Exploratory Data Analysis — Profiling & Diagnostics (pandas)

Learn practical EDA profiling: structural checks, missingness, type quality, descriptive stats, correlations, duplicates, leakage, memory, and automated profiling tools — with runnable examples and quizzes.

Quiz progress

0 / 10 answered

Contents

What is Profiling?
Quick Scan: structure & quality
Descriptive Stats
Missingness
Correlation & Associations
Duplicates & Leakage
Memory & Types
Automated Profilers
Summary

What is Profiling? #

shape & schema missingness distribution & outliers correlations

Profiling is a systematic first pass over a dataset to understand its structure, quality, and statistical properties. You’ll usually collect:

Shape, column names, dtypes, memory usage
Missingness (counts & percentages), constant/unique values
Descriptive stats for numeric/categorical features
Outlier heuristics (z-score, IQR), invalid values/ranges
Correlations/associations (Pearson/Spearman, Cramér’s V)
Duplicates, candidate keys, potential leakage

Quiz: Profiling Goals

1. Best description of EDA profiling is:

Making a histogram of each column Automatically fixing dirty data Systematically summarizing structure, quality, and stats

Quick Scan: structure & quality #

df.head/info dtypes isna/unique

import pandas as pd

df = pd.DataFrame({
    "order_id": ["A","B","C","D","E","F"],
    "amount":   [120,  90,  200, None, 75,  75],
    "city":     ["Austin","Austin","Boston","Chicago","Chicago","Chicago"],
    "signup":   ["2024-02-01","2024-02-08","2024-02-08","2024-02-09",None,"2024-02-11"],
    "vip":      [True, False, False, False, True, False]
})

# Parse dates
df["signup"] = pd.to_datetime(df["signup"], errors="coerce")

print("Shape:", df.shape)
df.info()                        # dtypes, non-null counts, memory
print("\nNumeric describe:\n", df.describe())      # numeric columns
print("\nCategorical describe:\n", df.describe(include="object"))
print("\nMissing per column:\n", df.isna().sum())
print("\nUnique counts:\n", df.nunique(dropna=True))

Quiz: Structure

2. df.info() provides… (select all)

Column dtypes & non-null counts Correlation matrix Approximate memory usage

Descriptive Statistics #

central tendency dispersion categorical summaries

# Numeric summary
num_summary = df.describe()  # count, mean, std, min, quartiles, max

# Include categorical
cat_summary = df.describe(include=["object","bool","category"])

# Custom percentiles
df.describe(percentiles=[.01,.05,.95,.99])

Quiz: Descriptives

3. Which call yields numeric summary statistics by default?

df.sample(5) df.describe() df.memory_usage()

Missingness #

MCAR/MAR/MNAR percent missing null patterns

# Column-level missingness % (0..1)
miss_pct = df.isna().mean().sort_values(ascending=False)
print(miss_pct)

# Row-level missingness counts
row_miss = df.isna().sum(axis=1)
print(row_miss.value_counts().sort_index())

# Null co-occurrence (simple)
print(df[["amount","signup"]].isna().mean())

Quiz: Missingness

4. Best one-liner for % missing per column:

df.isna().mean() df.value_counts(normalize=True) df.dropna()

Correlation & Associations #

Pearson/Spearman Cramér’s V (cats) target leakage checks

# Numeric-numeric correlations
pearson = df.corr(numeric_only=True)              # linear association
spearman = df.corr(method="spearman", numeric_only=True)  # rank-based

print("Pearson:\n", pearson)
print("\nSpearman:\n", spearman)

# Categorical-categorical association (Cramér's V)
# Example: city vs vip
import numpy as np
import scipy.stats as st

ct = pd.crosstab(df["city"], df["vip"])
chi2 = st.chi2_contingency(ct)[0]
n = ct.values.sum()
phi2 = chi2 / n
r, k = ct.shape
phi2corr = max(0, phi2 - (k-1)*(r-1)/(n-1))
rcorr = r - (r-1)**2/(n-1)
kcorr = k - (k-1)**2/(n-1)
cramers_v = np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
print("\nCramér's V city~vip:", round(cramers_v, 4))

Quiz: Associations

5. Which statement is true?

Pearson is rank-based and robust to monotonic curves Spearman is rank-based and robust to monotonic curves Cramér’s V is for numeric-numeric correlations

6. Cramér’s V is mainly used for:

Categorical–categorical association Numeric–numeric linear correlation Time series stationarity

Duplicates & Leakage #

duplicated candidate keys train/test leakage

# Count duplicate rows
print("Duplicate rows:", int(df.duplicated().sum()))

# Check candidate key (expect unique IDs)
print("order_id unique? ->", df["order_id"].is_unique)

# Simple leakage heuristic: features that are near-perfectly correlated with target
# (Example only; in real settings inspect business meaning)
# target = ...
# leak_corr = df.corr(numeric_only=True)[target].abs().sort_values(ascending=False).head(10)

Quiz: Duplicates & Leakage

7. How do you count duplicate rows?

df.duplicated() df.duplicated().sum() df.drop_duplicates()

8. Which is an example of target leakage?

Using one-hot encoding for categories Including a strong but valid predictor Using a feature computed with future information

Memory & Types #

downcast category dtype datetime parsing

# Downcast integers/floats
opt = df.copy()
for col in opt.select_dtypes(include="integer").columns:
    opt[col] = pd.to_numeric(opt[col], downcast="integer")
for col in opt.select_dtypes(include="floating").columns:
    opt[col] = pd.to_numeric(opt[col], downcast="float")

# Category for low-cardinality strings
if "city" in opt:
    opt["city"] = opt["city"].astype("category")

print("Before MB:", round(df.memory_usage(deep=True).sum()/1e6, 4))
print("After  MB:", round(opt.memory_usage(deep=True).sum()/1e6, 4))

Quiz: Memory & Types

9. Useful memory/type steps (select all):

Downcast numeric dtypes Use category for low-cardinality strings Parse date strings as datetime

Automated Profilers #

ydata-profiling Sweetviz privacy & performance

ydata-profiling (formerly pandas-profiling)

# In a local environment (not in-browser):
# pip install ydata-profiling
from ydata_profiling import ProfileReport

profile = ProfileReport(
    df, title="EDA Profile",
    explorative=True,  # extra correlations, warnings, duplicates…
    minimal=False
)
profile.to_file("profile.html")   # open this HTML report in your browser

Tips: sample large datasets first (df.sample(50000, random_state=0)), remove PII columns, and consider minimal=True for speed.

Sweetviz (beautiful exploratory reports)

# pip install sweetviz
import sweetviz as sv
report = sv.analyze(df)
report.show_html("sweetviz_report.html")

Quiz: Automated Profilers

10. Good cautions when generating auto reports (select all):

Scrub/anonymize PII before sharing Always run on full dataset for accuracy Sample large datasets to speed up

Final Quiz & Summary #

Review your score and identify sections to revisit. You can reset locally and try again.

Category: python · Lesson: eda-profiling

Try it yourself (runs in your browser)

Pyodide will load Python & pandas on first run. Edit and run.

import pandas as pd
import numpy as np

# Toy dataset with types, missingness, categories
rng = np.random.default_rng(7)
n = 24
df = pd.DataFrame({
    "order_id": [chr(65+i) for i in range(n)],
    "amount": rng.normal(120, 30, n).round(2),
    "discount": rng.choice([0, 5, 10, np.nan], size=n, p=[.5,.2,.2,.1]),
    "city": rng.choice(["Austin","Boston","Chicago","Denver"], size=n, p=[.3,.25,.3,.15]),
    "signup": pd.to_datetime("2024-02-01") + pd.to_timedelta(rng.integers(0, 20, n), unit="D"),
    "vip": rng.choice([True, False], size=n, p=[.3,.7]),
})
# Inject a few nulls and dup row
df.loc[5, "amount"] = np.nan
df.loc[10, "city"] = None
df = pd.concat([df, df.iloc[[3]]], ignore_index=True)  # single duplicate row

print("=== HEAD ===")
print(df.head(), end="\\n\\n")

print("=== INFO ===")
df.info()
print()

print("=== DESCRIBE (numeric) ===")
print(df.describe(), end="\\n\\n")

print("=== DESCRIBE (cats/bools) ===")
print(df.describe(include=["object","bool","category"]), end="\\n\\n")

print("=== MISSINGNESS (column %) ===")
print((df.isna().mean()*100).round(2).sort_values(ascending=False), end="\\n\\n")

print("=== DUPLICATES ===")
print("duplicate rows:", int(df.duplicated().sum()))
print("order_id unique?", df["order_id"].is_unique, end="\\n\\n")

print("=== CORRELATIONS (Pearson) ===")
print(df.corr(numeric_only=True))