Parquet & Feather

Columnar formats for fast analytics and data exchange. Learn how to read/write with pandas and pyarrow, pick codecs, and organize partitioned datasets.

Quiz progress

0 / 10 answered

Contents

Fundamentals
pandas I/O
pyarrow Dataset & partitioning
Schema & types
Compression & performance
Summary

Fundamentals #

Key ideas

Columnar storage ⇒ read only needed columns, excellent compression for homogeneous data.
Parquet: columnar, compressed, typed schema, great for analytics & partitioned datasets.
Feather (Arrow IPC file): very fast local read/write & interchange; lighter metadata than Parquet.

# Column pruning example (conceptual)
import pandas as pd
# Read only the two needed columns from a wide table:
df = pd.read_parquet("events.parquet", columns=["user_id","ts"])

“Move fewer bytes. Columnar I/O = speed + smaller bills.”

— Data engineering proverb

Quiz: Fundamentals

1. Main advantage of columnar formats like Parquet?

They always read entire rows Efficient column selection and compression They ignore data types

2. Parquet vs Feather: the best general guideline is…

Use Feather for compressed archival datasets Use Parquet only for tiny tables Use Parquet for analytics/partitioning; Feather for fast local exchange

pandas I/O #

Read/write with pandas

df.to_parquet(...) / pd.read_parquet(...) (engine: pyarrow or fastparquet).
df.to_feather(...) / pd.read_feather(...) (requires pyarrow).
Select columns on read with columns=[...]; write without index via index=False.

import pandas as pd

df = pd.DataFrame({
    "user_id":[1,2,3],
    "country":["PL","DE","US"],
    "amount":[12.5, 9.0, 7.2]
})

# Parquet
df.to_parquet("data.parquet", engine="pyarrow", compression="snappy", index=False)
subset = pd.read_parquet("data.parquet", columns=["user_id","amount"])

# Feather (Arrow IPC)
df.to_feather("data.feather")                       # fast local format
df2 = pd.read_feather("data.feather", columns=["user_id"])

Quiz: pandas I/O

3. Which call correctly writes a Parquet file with compression?

df = pd.DataFrame({"a":[1,2], "b":[3,4]})

df.to_parquet("t.parquet", engine="pyarrow", compression="snappy", index=False) pd.read_parquet("t.parquet", compression="snappy") df.to_csv("t.parquet")

4. Load only two columns from a wide Parquet file:

pd.read_csv("wide.parquet") pd.read_parquet("wide.parquet", columns=["a","b"]) pd.read_feather("wide.parquet")

pyarrow Dataset & Partitioning #

Partitioned datasets

Organize as root/col=value/col2=value2/part-*.parquet (Hive-style) for scalable reads.
Use predicate pushdown to read only matching partitions/row groups.

# pyarrow example (requires pyarrow installed)
import pyarrow as pa, pyarrow.dataset as ds, pyarrow.parquet as pq

table = pa.table({"year":[2024,2024,2025], "month":[7,7,1], "value":[1,2,3]})
ds.write_dataset(
    table, base_dir="out_ds", format="parquet",
    partitioning=["year","month"], existing_data_behavior="overwrite_or_ignore"
)

dataset = ds.dataset("out_ds", format="parquet", partitioning="hive")
# Read only 2024/07 partition and 'value' column
t = dataset.to_table(columns=["value"], filter=(ds.field("year")==2024) & (ds.field("month")==7))

Quiz: Dataset

5. A scalable layout for time-partitioned Parquet is…

One giant .tar with all tables A single .parquet file appended forever root/year=YYYY/month=MM/part-*.parquet

6. What does “predicate pushdown” give you when reading Parquet?

Less I/O by skipping non-matching data Decodes everything then filters Automatic schema changes

Schema & Types #

Type handling

Parquet/Feather store a typed schema (including nullability).
Arrow types (e.g., dictionary/categorical, timestamp with timezone) map to pandas dtypes via the engine.
Always validate dtypes on read; some backends may coerce to a nearest pandas type.

# pyarrow schema peek (if available)
import pyarrow as pa, pyarrow.parquet as pq
schema = pq.read_schema("data.parquet")
print(schema)

Quiz: Schema

7. True about Parquet/Feather schemas:

They store no schema metadata They encode column types and nullability They can’t represent booleans with nulls

8. Common Parquet compression codecs (select all):

snappy zstd lzma brotli

Compression & Performance #

Best practices

Read only required columns (columns=[...]); use partition pruning & filters.
Avoid millions of tiny files; prefer moderately sized row groups/files.
Feather is great for quick local caches; Parquet for data lakes & analytics.

# pandas read with column pruning
import pandas as pd
df = pd.read_parquet("events.parquet", columns=["user_id","event_type"])

Quiz: Performance

9. Which is a recommended practice for Parquet datasets?

Avoid millions of tiny files Prefer thousands of 1-KB files Append to a single Parquet file repeatedly

10. Can you efficiently append rows in place to Parquet/Feather files?

Only Feather supports it Only Parquet supports it Neither; write new files / use datasets

Final Quiz & Summary #

Review your performance and revisit questions you missed.

Category: python · Lesson: parquet-feather

Learning by Examples

This environment may not have pyarrow. The snippet detects availability and falls back with guidance.

Join lab

# Parquet & Feather playground (detects pandas/pyarrow)
import sys, importlib.util, textwrap

def have(mod): 
    return importlib.util.find_spec(mod) is not None

print("pandas available:", have("pandas"))
print("pyarrow available:", have("pyarrow"))

if not have("pandas"):
    print("\\nThis demo needs pandas. Try on your local machine with:")
    print("  pip install pandas pyarrow")
else:
    import pandas as pd
    df = pd.DataFrame({
        "user_id":[1,2,3,4,5],
        "country":["PL","DE","US","PL","US"],
        "amount":[12.5, 9.0, 7.2, 5.1, 99.9]
    })
    print("\\nDataFrame:")
    print(df.head())

    if have("pyarrow"):
        # Write Parquet and Feather
        df.to_parquet("demo.parquet", engine="pyarrow", compression="snappy", index=False)
        df.to_feather("demo.feather")  # Arrow IPC file

        # Column pruning read
        sub = pd.read_parquet("demo.parquet", columns=["user_id","amount"])
        print("\\nRead columns [user_id, amount] from Parquet:")
        print(sub)

        # Quick schema peek using pyarrow
        import pyarrow.parquet as pq
        schema = pq.read_schema("demo.parquet")
        print("\\nParquet schema:")
        print(schema)

        print("\\nFiles written: demo.parquet, demo.feather")
    else:
        print(textwrap.dedent(\"\"\"\
            \\npyarrow not found — cannot write Parquet/Feather here.
            Run locally:
              pip install pandas pyarrow
            Then:
              df.to_parquet("demo.parquet", engine="pyarrow", compression="snappy", index=False)
              df.to_feather("demo.feather")
        \"\"\"))