Parquet & Feather
Columnar formats for fast analytics and data exchange. Learn how to read/write with pandas and pyarrow, pick codecs, and organize partitioned datasets.
Quiz progress
0 / 10 answered
Contents
Fundamentals #
Key ideas
- Columnar storage ⇒ read only needed columns, excellent compression for homogeneous data.
- Parquet: columnar, compressed, typed schema, great for analytics & partitioned datasets.
- Feather (Arrow IPC file): very fast local read/write & interchange; lighter metadata than Parquet.
# Column pruning example (conceptual)
import pandas as pd
# Read only the two needed columns from a wide table:
df = pd.read_parquet("events.parquet", columns=["user_id","ts"])
“Move fewer bytes. Columnar I/O = speed + smaller bills.”
Quiz: Fundamentals
1. Main advantage of columnar formats like Parquet?
2. Parquet vs Feather: the best general guideline is…
pandas I/O #
Read/write with pandas
df.to_parquet(...)/pd.read_parquet(...)(engine:pyarroworfastparquet).df.to_feather(...)/pd.read_feather(...)(requirespyarrow).- Select columns on read with
columns=[...]; write without index viaindex=False.
import pandas as pd
df = pd.DataFrame({
"user_id":[1,2,3],
"country":["PL","DE","US"],
"amount":[12.5, 9.0, 7.2]
})
# Parquet
df.to_parquet("data.parquet", engine="pyarrow", compression="snappy", index=False)
subset = pd.read_parquet("data.parquet", columns=["user_id","amount"])
# Feather (Arrow IPC)
df.to_feather("data.feather") # fast local format
df2 = pd.read_feather("data.feather", columns=["user_id"])
Quiz: pandas I/O
3. Which call correctly writes a Parquet file with compression?
df = pd.DataFrame({"a":[1,2], "b":[3,4]})
4. Load only two columns from a wide Parquet file:
pyarrow Dataset & Partitioning #
Partitioned datasets
- Organize as
root/col=value/col2=value2/part-*.parquet(Hive-style) for scalable reads. - Use predicate pushdown to read only matching partitions/row groups.
# pyarrow example (requires pyarrow installed)
import pyarrow as pa, pyarrow.dataset as ds, pyarrow.parquet as pq
table = pa.table({"year":[2024,2024,2025], "month":[7,7,1], "value":[1,2,3]})
ds.write_dataset(
table, base_dir="out_ds", format="parquet",
partitioning=["year","month"], existing_data_behavior="overwrite_or_ignore"
)
dataset = ds.dataset("out_ds", format="parquet", partitioning="hive")
# Read only 2024/07 partition and 'value' column
t = dataset.to_table(columns=["value"], filter=(ds.field("year")==2024) & (ds.field("month")==7))
Quiz: Dataset
5. A scalable layout for time-partitioned Parquet is…
6. What does “predicate pushdown” give you when reading Parquet?
Schema & Types #
Type handling
- Parquet/Feather store a typed schema (including nullability).
- Arrow types (e.g., dictionary/categorical, timestamp with timezone) map to pandas dtypes via the engine.
- Always validate dtypes on read; some backends may coerce to a nearest pandas type.
# pyarrow schema peek (if available)
import pyarrow as pa, pyarrow.parquet as pq
schema = pq.read_schema("data.parquet")
print(schema)
Quiz: Schema
7. True about Parquet/Feather schemas:
8. Common Parquet compression codecs (select all):
Compression & Performance #
Best practices
- Read only required columns (
columns=[...]); use partition pruning & filters. - Avoid millions of tiny files; prefer moderately sized row groups/files.
- Feather is great for quick local caches; Parquet for data lakes & analytics.
# pandas read with column pruning
import pandas as pd
df = pd.read_parquet("events.parquet", columns=["user_id","event_type"])
Quiz: Performance
9. Which is a recommended practice for Parquet datasets?
10. Can you efficiently append rows in place to Parquet/Feather files?
Final Quiz & Summary #
Review your performance and revisit questions you missed.
Category: python · Lesson: parquet-feather
Learning by Examples
This environment may not have
pyarrow. The snippet detects availability and falls back with guidance.# Parquet & Feather playground (detects pandas/pyarrow)
import sys, importlib.util, textwrap
def have(mod):
return importlib.util.find_spec(mod) is not None
print("pandas available:", have("pandas"))
print("pyarrow available:", have("pyarrow"))
if not have("pandas"):
print("\\nThis demo needs pandas. Try on your local machine with:")
print(" pip install pandas pyarrow")
else:
import pandas as pd
df = pd.DataFrame({
"user_id":[1,2,3,4,5],
"country":["PL","DE","US","PL","US"],
"amount":[12.5, 9.0, 7.2, 5.1, 99.9]
})
print("\\nDataFrame:")
print(df.head())
if have("pyarrow"):
# Write Parquet and Feather
df.to_parquet("demo.parquet", engine="pyarrow", compression="snappy", index=False)
df.to_feather("demo.feather") # Arrow IPC file
# Column pruning read
sub = pd.read_parquet("demo.parquet", columns=["user_id","amount"])
print("\\nRead columns [user_id, amount] from Parquet:")
print(sub)
# Quick schema peek using pyarrow
import pyarrow.parquet as pq
schema = pq.read_schema("demo.parquet")
print("\\nParquet schema:")
print(schema)
print("\\nFiles written: demo.parquet, demo.feather")
else:
print(textwrap.dedent(\"\"\"\
\\npyarrow not found — cannot write Parquet/Feather here.
Run locally:
pip install pandas pyarrow
Then:
df.to_parquet("demo.parquet", engine="pyarrow", compression="snappy", index=False)
df.to_feather("demo.feather")
\"\"\"))