Turn Thousands of Messy JSON Files into One Parquet: DuckDB for Fast Data Warehouse Ingestion
If you work with API data long enough you eventually inherit a directory or bucket full of tiny JSON files—one per product, event batch, log minute, user snapshot—because “it was easy at the time.” That convenience tax shows up later as slow scans, schema anxiety, and rising warehouse bills. DuckDB lets you pay that tax once: consolidate early, then model against something stable and polite.
Why Tiny JSON Files Hurt Your Pipeline
APIs commonly emit one JSON per entity or per interval, produce uneven batches, introduce or drop nested fields over time, and repeat key names excessively. On object storage you then pay a latency and request tax just to list and fetch thousands of small objects before you even parse them. Warehouses must then repeatedly decode untyped text. Schema drift further forces defensive SQL, and “quick reprocessing” means walking the file tree again. Instead of pushing that complexity downstream, you can touch your raw files once, normalize early and expose a stable surface for analytics and modeling 🤘.
The High-Level Pattern
- Land raw JSON files (treat them as immutable evidence you can always revisit).
- Use DuckDB to read them via a glob, inferring or coercing a consistent schema.
- Emit one (or a controlled set of) Parquet artifact(s) with lineage metadata.
- Load or reference those Parquet files from your warehouse or transform directly in DuckDB/dbt.
- Iterate safely without repeatedly touching the original file explosion.
Minimal implementation:
import duckdb, pathlib, uuid
def json_dir_to_parquet(source_dir, target_dir, db=":memory:", ext=".json"):
"""
Consolidate many JSON files into a single Parquet file.
Retains original filename (lineage) via DuckDB's filename option.
"""
conn = duckdb.connect(db) # Point to a file path for persistence if needed
pathlib.Path(target_dir).mkdir(parents=True, exist_ok=True)
out_file = f"{target_dir}/{uuid.uuid4()}.parquet"
conn.execute(f"""
COPY (
SELECT *, filename
FROM read_json_auto('{source_dir}/*{ext}')
)
TO '{out_file}' (FORMAT 'parquet');
""") # The automatic filename column can be used for referencing and debugging
return out_file
If all you needed was “make the tiny files go away,” you can stop here. Curious about drift, loading paths, or edge cases? Read on.
Mental Model of the Flow
This pattern fits best for anything small, frequent, and semantically similar. If the directory “feels” like a table but performance “feels” like waiting for a Databricks cluster to warm up during month-end reporting, you are the target audience. Here’s how I think of the flow in general:
- Raw JSON: many small, inconsistent files
- DuckDB local file scan
- Parquet: compressed, columnar, type‑stable (output to S3 or similar)
- Read from warehouse / dbt models
DuckDB is effective here because it reads JSON natively (no bespoke iterator loops), can push projections and filters early, writes high‑quality Parquet with statistics, and runs happily in a short‑lived local or ephemeral context.
Parquet vs Raw JSON (Why Bother?)
Concern | JSON | Parquet |
---|---|---|
Compression | Repeated keys waste space | Columnar encoding + compression |
Selectivity | All fields decoded | Column pruning + predicate pushdown |
Schema | Implicit and drifting | Explicit and enforced |
Scan cost | Higher per query | Lower thanks to stats & min/max |
Type safety | Late surprises | Early, fail-fast casting |
Merging | Manual juggling | Aligns by column name & type |
Performance & Cost Considerations
Reading fifteen thousand ≈5 KB JSON files directly from object storage is dominated by object listing round‑trips and many small request overheads; each query re‑parses verbose text. After consolidation you typically read a handful of contiguous Parquet files, prune non‑referenced columns, filter earlier, and scan fewer bytes—often cutting warehouse credits or execution time by half or more. CPU cycles per row also drop because vectorized decoding replaces repeated per‑row JSON parsing.
If you want to run fully in the cloud without S3, you can also batch the raw files onto an SSD/NVMe (or a cache layer) before conversion; that alone frequently yields a 2–10× wall‑clock improvement for large small‑file batches compared to loading from e.g. S3.
A Clean, Repeatable Conversion Function
import duckdb, pathlib, uuid
def process_files(source_dir, target_dir, database_file=":memory:", file_extension=".json"):
conn = duckdb.connect(database_file)
pathlib.Path(target_dir).mkdir(parents=True, exist_ok=True)
output_file_path = f"{target_dir}/{uuid.uuid4()}.parquet"
conn.execute(f"""
COPY (
SELECT *, filename
FROM read_json_auto('{source_dir}/*{file_extension}', filename=true)
)
TO '{output_file_path}' (FORMAT 'parquet');
""")
return output_file_path
Refinements you can layer in: explicit casts, early filtering (“reject before you persist”), or array expansion for fact tables. Think of this step as squaring things away before the warehouse starts charging rent.
Loading Into a Warehouse (General Approaches)
You can
- push data through DuckDB’s connectors (e.g. Postgres, MotherDuck),
- land Parquet in object storage and invoke native warehouse COPY/LOAD, or
- query Parquet directly with the DuckDB dbt adapter, then publish curated models to your primary warehouse.
Here’s an example connecting Postgres through DuckDB
-- Install only needed once per environment
INSTALL postgres;
LOAD postgres;
ATTACH 'dbname=analytics host=localhost user=me password=secret' AS pg (TYPE POSTGRES);
CREATE TABLE pg.public.events AS
SELECT * FROM read_parquet('output/xyz.parquet');
For Snowflake, BigQuery, or Redshift you would stage the Parquet and run their COPY / LOAD statements via orchestration.
Using dbt with DuckDB External Sources
One pattern I like to use is referencing these Parquet files we’ve generated as external sources in dbt-duckdb. This allows you to use certain sub-directories as a dynamic table name. In the example below we’ve stored our data in files like country-products/parsed/NL/random-file-name-xyz.parquet
, but we can access them as separate country tables in our dbt model.
sources:
- name: country_products
meta:
external_location: 's3://country-products/parsed/{name}/*.parquet'
tables:
- name: NL
- name: DE
- name: UK
select * from {{ source('country_products', 'NL') }}
Handling Schema Drift and Edge Cases
Treat drift as a fact of life, not a failure of planning. You can decide how you want to handle this: absorb, coerce, extract or isolate. It’s up to you.
SELECT * FROM read_json_auto('dir/a/*.json')
UNION BY NAME
SELECT * FROM read_json_auto('dir/b/*.json');
Selective extraction:
SELECT
json_extract_string(raw, '$.id') AS id,
try_cast(json_extract(raw,'$.metrics.score') AS DOUBLE) AS score,
json_extract(raw,'$.payload.attributes') AS attributes
FROM read_ndjson_auto('*.json');
When You Truly Need Manual Python Parsing
If you feel yourself writing loops for “normal” JSON ingestion, you may be fighting the engine. Loop only when unpredictable I/O or enrichment actually requires it.
import duckdb, pathlib, uuid, json
def resilient_process(source_dir, target_dir, db=":memory:", ext=".json"):
conn = duckdb.connect(db)
table = f"t_{uuid.uuid4().hex}"
conn.execute(f"""
CREATE TABLE {table} (
id TEXT PRIMARY KEY,
data JSON,
source_file TEXT
)
""") # Of course you can use any schema that fits your needs
files = [p for p in pathlib.Path(source_dir).iterdir()
if p.suffix == ext and not p.name.startswith('.')]
for idx, path in enumerate(files):
if idx % 250 == 0:
print(f"[{idx}/{len(files)}] {path.name}")
try:
with path.open() as f:
raw = json.load(f)
row_id = raw.get("id") or path.stem
conn.execute(
f"INSERT OR IGNORE INTO {table} VALUES (?, ?, ?)",
[row_id, json.dumps(raw), path.name]
)
except Exception as e:
print(f"Skip {path.name}: {e}")
out = f"{target_dir}/{uuid.uuid4()}.parquet"
pathlib.Path(target_dir).mkdir(parents=True, exist_ok=True)
conn.execute(f"COPY {table} TO '{out}' (FORMAT 'parquet');")
return out
Recommended Operational Practices
Do:
- Retain raw JSON in cold storage; serve Parquet as the hot analytical layer. My motto is: only touch it once.
- Add
ingested_at
,source_file
, and a batch/run identifier. - Partition logically (e.g. by extraction date or region).
- Maintain a manifest (file_count, byte_size, row_count, schema hash).
- Diff schema across runs in CI to catch silent drift early.
Avoid:
- Combining unrelated domains in a single Parquet output.
- Casting every field to TEXT “just in case.”
- Producing thousands of micro‑Parquet shards (undoes consolidation benefits).
If you like to extend it further you can even add incremental processing (track seen filenames in a lightweight manifest), automated schema diff alerts, a tiny statistics table for freshness dashboards, or simple row count variance checks (previous vs current batch) as a basic quality gate.
Summary
A tiny DuckDB step lets you swap chaos (N files) for clarity (1–few Parquet artifacts). Start with read_json_auto
; escalate only when the data misbehaves. The return on consolidation compounds with every downstream query you don’t have to overpay for.