Exploratory Data Analysis — Profiling & Diagnostics (pandas)
Learn practical EDA profiling: structural checks, missingness, type quality, descriptive stats, correlations, duplicates, leakage, memory, and automated profiling tools — with runnable examples and quizzes.
Quiz progress
0 / 10 answered
Contents
What is Profiling? #
shape & schema
missingness
distribution & outliers
correlations
Profiling is a systematic first pass over a dataset to understand its structure, quality, and statistical properties. You’ll usually collect:
- Shape, column names, dtypes, memory usage
- Missingness (counts & percentages), constant/unique values
- Descriptive stats for numeric/categorical features
- Outlier heuristics (z-score, IQR), invalid values/ranges
- Correlations/associations (Pearson/Spearman, Cramér’s V)
- Duplicates, candidate keys, potential leakage
Quiz: Profiling Goals
1. Best description of EDA profiling is:
Quick Scan: structure & quality #
df.head/info
dtypes
isna/unique
import pandas as pd
df = pd.DataFrame({
"order_id": ["A","B","C","D","E","F"],
"amount": [120, 90, 200, None, 75, 75],
"city": ["Austin","Austin","Boston","Chicago","Chicago","Chicago"],
"signup": ["2024-02-01","2024-02-08","2024-02-08","2024-02-09",None,"2024-02-11"],
"vip": [True, False, False, False, True, False]
})
# Parse dates
df["signup"] = pd.to_datetime(df["signup"], errors="coerce")
print("Shape:", df.shape)
df.info() # dtypes, non-null counts, memory
print("\nNumeric describe:\n", df.describe()) # numeric columns
print("\nCategorical describe:\n", df.describe(include="object"))
print("\nMissing per column:\n", df.isna().sum())
print("\nUnique counts:\n", df.nunique(dropna=True))
Quiz: Structure
2.
df.info() provides… (select all)Descriptive Statistics #
central tendency
dispersion
categorical summaries
# Numeric summary
num_summary = df.describe() # count, mean, std, min, quartiles, max
# Include categorical
cat_summary = df.describe(include=["object","bool","category"])
# Custom percentiles
df.describe(percentiles=[.01,.05,.95,.99])
Quiz: Descriptives
3. Which call yields numeric summary statistics by default?
Missingness #
MCAR/MAR/MNAR
percent missing
null patterns
# Column-level missingness % (0..1)
miss_pct = df.isna().mean().sort_values(ascending=False)
print(miss_pct)
# Row-level missingness counts
row_miss = df.isna().sum(axis=1)
print(row_miss.value_counts().sort_index())
# Null co-occurrence (simple)
print(df[["amount","signup"]].isna().mean())
Quiz: Missingness
4. Best one-liner for % missing per column:
Correlation & Associations #
Pearson/Spearman
Cramér’s V (cats)
target leakage checks
# Numeric-numeric correlations
pearson = df.corr(numeric_only=True) # linear association
spearman = df.corr(method="spearman", numeric_only=True) # rank-based
print("Pearson:\n", pearson)
print("\nSpearman:\n", spearman)
# Categorical-categorical association (Cramér's V)
# Example: city vs vip
import numpy as np
import scipy.stats as st
ct = pd.crosstab(df["city"], df["vip"])
chi2 = st.chi2_contingency(ct)[0]
n = ct.values.sum()
phi2 = chi2 / n
r, k = ct.shape
phi2corr = max(0, phi2 - (k-1)*(r-1)/(n-1))
rcorr = r - (r-1)**2/(n-1)
kcorr = k - (k-1)**2/(n-1)
cramers_v = np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
print("\nCramér's V city~vip:", round(cramers_v, 4))
Quiz: Associations
5. Which statement is true?
6. Cramér’s V is mainly used for:
Duplicates & Leakage #
duplicated
candidate keys
train/test leakage
# Count duplicate rows
print("Duplicate rows:", int(df.duplicated().sum()))
# Check candidate key (expect unique IDs)
print("order_id unique? ->", df["order_id"].is_unique)
# Simple leakage heuristic: features that are near-perfectly correlated with target
# (Example only; in real settings inspect business meaning)
# target = ...
# leak_corr = df.corr(numeric_only=True)[target].abs().sort_values(ascending=False).head(10)
Quiz: Duplicates & Leakage
7. How do you count duplicate rows?
8. Which is an example of target leakage?
Memory & Types #
downcast
category dtype
datetime parsing
# Downcast integers/floats
opt = df.copy()
for col in opt.select_dtypes(include="integer").columns:
opt[col] = pd.to_numeric(opt[col], downcast="integer")
for col in opt.select_dtypes(include="floating").columns:
opt[col] = pd.to_numeric(opt[col], downcast="float")
# Category for low-cardinality strings
if "city" in opt:
opt["city"] = opt["city"].astype("category")
print("Before MB:", round(df.memory_usage(deep=True).sum()/1e6, 4))
print("After MB:", round(opt.memory_usage(deep=True).sum()/1e6, 4))
Quiz: Memory & Types
9. Useful memory/type steps (select all):
Automated Profilers #
ydata-profiling
Sweetviz
privacy & performance
ydata-profiling (formerly pandas-profiling)
# In a local environment (not in-browser):
# pip install ydata-profiling
from ydata_profiling import ProfileReport
profile = ProfileReport(
df, title="EDA Profile",
explorative=True, # extra correlations, warnings, duplicates…
minimal=False
)
profile.to_file("profile.html") # open this HTML report in your browser
Tips: sample large datasets first (
df.sample(50000, random_state=0)), remove PII columns, and consider minimal=True for speed.
Sweetviz (beautiful exploratory reports)
# pip install sweetviz
import sweetviz as sv
report = sv.analyze(df)
report.show_html("sweetviz_report.html")
Quiz: Automated Profilers
10. Good cautions when generating auto reports (select all):
Final Quiz & Summary #
Review your score and identify sections to revisit. You can reset locally and try again.
Category: python · Lesson: eda-profiling
Try it yourself (runs in your browser)
Pyodide will load Python & pandas on first run. Edit and run.
import pandas as pd
import numpy as np
# Toy dataset with types, missingness, categories
rng = np.random.default_rng(7)
n = 24
df = pd.DataFrame({
"order_id": [chr(65+i) for i in range(n)],
"amount": rng.normal(120, 30, n).round(2),
"discount": rng.choice([0, 5, 10, np.nan], size=n, p=[.5,.2,.2,.1]),
"city": rng.choice(["Austin","Boston","Chicago","Denver"], size=n, p=[.3,.25,.3,.15]),
"signup": pd.to_datetime("2024-02-01") + pd.to_timedelta(rng.integers(0, 20, n), unit="D"),
"vip": rng.choice([True, False], size=n, p=[.3,.7]),
})
# Inject a few nulls and dup row
df.loc[5, "amount"] = np.nan
df.loc[10, "city"] = None
df = pd.concat([df, df.iloc[[3]]], ignore_index=True) # single duplicate row
print("=== HEAD ===")
print(df.head(), end="\\n\\n")
print("=== INFO ===")
df.info()
print()
print("=== DESCRIBE (numeric) ===")
print(df.describe(), end="\\n\\n")
print("=== DESCRIBE (cats/bools) ===")
print(df.describe(include=["object","bool","category"]), end="\\n\\n")
print("=== MISSINGNESS (column %) ===")
print((df.isna().mean()*100).round(2).sort_values(ascending=False), end="\\n\\n")
print("=== DUPLICATES ===")
print("duplicate rows:", int(df.duplicated().sum()))
print("order_id unique?", df["order_id"].is_unique, end="\\n\\n")
print("=== CORRELATIONS (Pearson) ===")
print(df.corr(numeric_only=True))