Parquet & Feather

Columnar formats for fast analytics and data exchange. Learn how to read/write with pandas and pyarrow, pick codecs, and organize partitioned datasets.

Quiz progress
0 / 10 answered
Contents

Fundamentals #

Key ideas
  • Columnar storage ⇒ read only needed columns, excellent compression for homogeneous data.
  • Parquet: columnar, compressed, typed schema, great for analytics & partitioned datasets.
  • Feather (Arrow IPC file): very fast local read/write & interchange; lighter metadata than Parquet.
# Column pruning example (conceptual)
import pandas as pd
# Read only the two needed columns from a wide table:
df = pd.read_parquet("events.parquet", columns=["user_id","ts"])
“Move fewer bytes. Columnar I/O = speed + smaller bills.”
— Data engineering proverb

Quiz: Fundamentals

1. Main advantage of columnar formats like Parquet?
2. Parquet vs Feather: the best general guideline is…

pandas I/O #

Read/write with pandas
  • df.to_parquet(...) / pd.read_parquet(...) (engine: pyarrow or fastparquet).
  • df.to_feather(...) / pd.read_feather(...) (requires pyarrow).
  • Select columns on read with columns=[...]; write without index via index=False.
import pandas as pd

df = pd.DataFrame({
    "user_id":[1,2,3],
    "country":["PL","DE","US"],
    "amount":[12.5, 9.0, 7.2]
})

# Parquet
df.to_parquet("data.parquet", engine="pyarrow", compression="snappy", index=False)
subset = pd.read_parquet("data.parquet", columns=["user_id","amount"])

# Feather (Arrow IPC)
df.to_feather("data.feather")                       # fast local format
df2 = pd.read_feather("data.feather", columns=["user_id"])

Quiz: pandas I/O

3. Which call correctly writes a Parquet file with compression?
df = pd.DataFrame({"a":[1,2], "b":[3,4]})
4. Load only two columns from a wide Parquet file:

pyarrow Dataset & Partitioning #

Partitioned datasets
  • Organize as root/col=value/col2=value2/part-*.parquet (Hive-style) for scalable reads.
  • Use predicate pushdown to read only matching partitions/row groups.
# pyarrow example (requires pyarrow installed)
import pyarrow as pa, pyarrow.dataset as ds, pyarrow.parquet as pq

table = pa.table({"year":[2024,2024,2025], "month":[7,7,1], "value":[1,2,3]})
ds.write_dataset(
    table, base_dir="out_ds", format="parquet",
    partitioning=["year","month"], existing_data_behavior="overwrite_or_ignore"
)

dataset = ds.dataset("out_ds", format="parquet", partitioning="hive")
# Read only 2024/07 partition and 'value' column
t = dataset.to_table(columns=["value"], filter=(ds.field("year")==2024) & (ds.field("month")==7))

Quiz: Dataset

5. A scalable layout for time-partitioned Parquet is…
6. What does “predicate pushdown” give you when reading Parquet?

Schema & Types #

Type handling
  • Parquet/Feather store a typed schema (including nullability).
  • Arrow types (e.g., dictionary/categorical, timestamp with timezone) map to pandas dtypes via the engine.
  • Always validate dtypes on read; some backends may coerce to a nearest pandas type.
# pyarrow schema peek (if available)
import pyarrow as pa, pyarrow.parquet as pq
schema = pq.read_schema("data.parquet")
print(schema)

Quiz: Schema

7. True about Parquet/Feather schemas:
8. Common Parquet compression codecs (select all):

Compression & Performance #

Best practices
  • Read only required columns (columns=[...]); use partition pruning & filters.
  • Avoid millions of tiny files; prefer moderately sized row groups/files.
  • Feather is great for quick local caches; Parquet for data lakes & analytics.
# pandas read with column pruning
import pandas as pd
df = pd.read_parquet("events.parquet", columns=["user_id","event_type"])

Quiz: Performance

9. Which is a recommended practice for Parquet datasets?
10. Can you efficiently append rows in place to Parquet/Feather files?

Final Quiz & Summary #

Review your performance and revisit questions you missed.

Category: python · Lesson: parquet-feather
Learning by Examples
This environment may not have pyarrow. The snippet detects availability and falls back with guidance.
# Parquet & Feather playground (detects pandas/pyarrow)
import sys, importlib.util, textwrap

def have(mod): 
    return importlib.util.find_spec(mod) is not None

print("pandas available:", have("pandas"))
print("pyarrow available:", have("pyarrow"))

if not have("pandas"):
    print("\\nThis demo needs pandas. Try on your local machine with:")
    print("  pip install pandas pyarrow")
else:
    import pandas as pd
    df = pd.DataFrame({
        "user_id":[1,2,3,4,5],
        "country":["PL","DE","US","PL","US"],
        "amount":[12.5, 9.0, 7.2, 5.1, 99.9]
    })
    print("\\nDataFrame:")
    print(df.head())

    if have("pyarrow"):
        # Write Parquet and Feather
        df.to_parquet("demo.parquet", engine="pyarrow", compression="snappy", index=False)
        df.to_feather("demo.feather")  # Arrow IPC file

        # Column pruning read
        sub = pd.read_parquet("demo.parquet", columns=["user_id","amount"])
        print("\\nRead columns [user_id, amount] from Parquet:")
        print(sub)

        # Quick schema peek using pyarrow
        import pyarrow.parquet as pq
        schema = pq.read_schema("demo.parquet")
        print("\\nParquet schema:")
        print(schema)

        print("\\nFiles written: demo.parquet, demo.feather")
    else:
        print(textwrap.dedent(\"\"\"\
            \\npyarrow not found — cannot write Parquet/Feather here.
            Run locally:
              pip install pandas pyarrow
            Then:
              df.to_parquet("demo.parquet", engine="pyarrow", compression="snappy", index=False)
              df.to_feather("demo.feather")
        \"\"\"))