Turn Thousands of Messy JSON Files into One Parquet: DuckDB for Fast Data Warehouse Ingestion

If you work with API data long enough you eventually inherit a directory or bucket full of tiny JSON files—one per product, event batch, log minute, user snapshot—because “it was easy at the time.” That convenience tax shows up later as slow scans, schema anxiety, and rising warehouse bills. DuckDB lets you pay that tax once: consolidate early, then model against something stable and polite.

Why Tiny JSON Files Hurt Your Pipeline

APIs commonly emit one JSON per entity or per interval, produce uneven batches, introduce or drop nested fields over time, and repeat key names excessively. On object storage you then pay a latency and request tax just to list and fetch thousands of small objects before you even parse them. Warehouses must then repeatedly decode untyped text. Schema drift further forces defensive SQL, and “quick reprocessing” means walking the file tree again. Instead of pushing that complexity downstream, you can touch your raw files once, normalize early and expose a stable surface for analytics and modeling 🤘.

The High-Level Pattern

Land raw JSON files (treat them as immutable evidence you can always revisit).
Use DuckDB to read them via a glob, inferring or coercing a consistent schema.
Emit one (or a controlled set of) Parquet artifact(s) with lineage metadata.
Load or reference those Parquet files from your warehouse or transform directly in DuckDB/dbt.
Iterate safely without repeatedly touching the original file explosion.

Minimal implementation:

import duckdb, pathlib, uuid

def json_dir_to_parquet(source_dir, target_dir, db=":memory:", ext=".json"):
    """
    Consolidate many JSON files into a single Parquet file.
    Retains original filename (lineage) via DuckDB's filename option.
    """
    conn = duckdb.connect(db)  # Point to a file path for persistence if needed
    pathlib.Path(target_dir).mkdir(parents=True, exist_ok=True)
    out_file = f"{target_dir}/{uuid.uuid4()}.parquet"
    conn.execute(f"""
        COPY (
          SELECT *, filename
          FROM read_json_auto('{source_dir}/*{ext}')
        )
        TO '{out_file}' (FORMAT 'parquet');
    """) # The automatic filename column can be used for referencing and debugging
    
    return out_file

If all you needed was “make the tiny files go away,” you can stop here. Curious about drift, loading paths, or edge cases? Read on.

Mental Model of the Flow

This pattern fits best for anything small, frequent, and semantically similar. If the directory “feels” like a table but performance “feels” like waiting for a Databricks cluster to warm up during month-end reporting, you are the target audience. Here’s how I think of the flow in general:

Raw JSON: many small, inconsistent files
DuckDB local file scan
Parquet: compressed, columnar, type‑stable (output to S3 or similar)
Read from warehouse / dbt models

DuckDB is effective here because it reads JSON natively (no bespoke iterator loops), can push projections and filters early, writes high‑quality Parquet with statistics, and runs happily in a short‑lived local or ephemeral context.

Parquet vs Raw JSON (Why Bother?)

Concern	JSON	Parquet
Compression	Repeated keys waste space	Columnar encoding + compression
Selectivity	All fields decoded	Column pruning + predicate pushdown
Schema	Implicit and drifting	Explicit and enforced
Scan cost	Higher per query	Lower thanks to stats & min/max
Type safety	Late surprises	Early, fail-fast casting
Merging	Manual juggling	Aligns by column name & type

Performance & Cost Considerations

Reading fifteen thousand ≈5 KB JSON files directly from object storage is dominated by object listing round‑trips and many small request overheads; each query re‑parses verbose text. After consolidation you typically read a handful of contiguous Parquet files, prune non‑referenced columns, filter earlier, and scan fewer bytes—often cutting warehouse credits or execution time by half or more. CPU cycles per row also drop because vectorized decoding replaces repeated per‑row JSON parsing.

If you want to run fully in the cloud without S3, you can also batch the raw files onto an SSD/NVMe (or a cache layer) before conversion; that alone frequently yields a 2–10× wall‑clock improvement for large small‑file batches compared to loading from e.g. S3.

A Clean, Repeatable Conversion Function

import duckdb, pathlib, uuid

def process_files(source_dir, target_dir, database_file=":memory:", file_extension=".json"):
    conn = duckdb.connect(database_file)
    pathlib.Path(target_dir).mkdir(parents=True, exist_ok=True)
    output_file_path = f"{target_dir}/{uuid.uuid4()}.parquet"
    conn.execute(f"""
        COPY (
            SELECT *, filename
            FROM read_json_auto('{source_dir}/*{file_extension}', filename=true)
        )
        TO '{output_file_path}' (FORMAT 'parquet');
    """)
    return output_file_path

Refinements you can layer in: explicit casts, early filtering (“reject before you persist”), or array expansion for fact tables. Think of this step as squaring things away before the warehouse starts charging rent.

Loading Into a Warehouse (General Approaches)

You can

push data through DuckDB’s connectors (e.g. Postgres, MotherDuck),
land Parquet in object storage and invoke native warehouse COPY/LOAD, or
query Parquet directly with the DuckDB dbt adapter, then publish curated models to your primary warehouse.

Here’s an example connecting Postgres through DuckDB

-- Install only needed once per environment
INSTALL postgres;
LOAD postgres;

ATTACH 'dbname=analytics host=localhost user=me password=secret' AS pg (TYPE POSTGRES);

CREATE TABLE pg.public.events AS
SELECT * FROM read_parquet('output/xyz.parquet');

For Snowflake, BigQuery, or Redshift you would stage the Parquet and run their COPY / LOAD statements via orchestration.

Using dbt with DuckDB External Sources

One pattern I like to use is referencing these Parquet files we’ve generated as external sources in dbt-duckdb. This allows you to use certain sub-directories as a dynamic table name. In the example below we’ve stored our data in files like country-products/parsed/NL/random-file-name-xyz.parquet, but we can access them as separate country tables in our dbt model.

sources:
  - name: country_products
    meta:
      external_location: 's3://country-products/parsed/{name}/*.parquet'
    tables:
      - name: NL
      - name: DE
      - name: UK

select * from {{ source('country_products', 'NL') }}

Handling Schema Drift and Edge Cases

Treat drift as a fact of life, not a failure of planning. You can decide how you want to handle this: absorb, coerce, extract or isolate. It’s up to you.

SELECT * FROM read_json_auto('dir/a/*.json')
UNION BY NAME
SELECT * FROM read_json_auto('dir/b/*.json');

Selective extraction:

SELECT
  json_extract_string(raw, '$.id') AS id,
  try_cast(json_extract(raw,'$.metrics.score') AS DOUBLE) AS score,
  json_extract(raw,'$.payload.attributes') AS attributes
FROM read_ndjson_auto('*.json');

When You Truly Need Manual Python Parsing

If you feel yourself writing loops for “normal” JSON ingestion, you may be fighting the engine. Loop only when unpredictable I/O or enrichment actually requires it.

import duckdb, pathlib, uuid, json

def resilient_process(source_dir, target_dir, db=":memory:", ext=".json"):
    conn = duckdb.connect(db)
    table = f"t_{uuid.uuid4().hex}"
    conn.execute(f"""
        CREATE TABLE {table} (
            id TEXT PRIMARY KEY,
            data JSON,
            source_file TEXT
        ) 
    """) # Of course you can use any schema that fits your needs
    files = [p for p in pathlib.Path(source_dir).iterdir()
             if p.suffix == ext and not p.name.startswith('.')]
    for idx, path in enumerate(files):
        if idx % 250 == 0:
            print(f"[{idx}/{len(files)}] {path.name}")
        try:
            with path.open() as f:
                raw = json.load(f)
            row_id = raw.get("id") or path.stem
            conn.execute(
                f"INSERT OR IGNORE INTO {table} VALUES (?, ?, ?)",
                [row_id, json.dumps(raw), path.name]
            )
        except Exception as e:
            print(f"Skip {path.name}: {e}")
    out = f"{target_dir}/{uuid.uuid4()}.parquet"
    pathlib.Path(target_dir).mkdir(parents=True, exist_ok=True)
    conn.execute(f"COPY {table} TO '{out}' (FORMAT 'parquet');")
    return out

Recommended Operational Practices

Do:

Retain raw JSON in cold storage; serve Parquet as the hot analytical layer. My motto is: only touch it once.
Add ingested_at, source_file, and a batch/run identifier.
Partition logically (e.g. by extraction date or region).
Maintain a manifest (file_count, byte_size, row_count, schema hash).
Diff schema across runs in CI to catch silent drift early.

Avoid:

Combining unrelated domains in a single Parquet output.
Casting every field to TEXT “just in case.”
Producing thousands of micro‑Parquet shards (undoes consolidation benefits).

If you like to extend it further you can even add incremental processing (track seen filenames in a lightweight manifest), automated schema diff alerts, a tiny statistics table for freshness dashboards, or simple row count variance checks (previous vs current batch) as a basic quality gate.

Summary

A tiny DuckDB step lets you swap chaos (N files) for clarity (1–few Parquet artifacts). Start with read_json_auto; escalate only when the data misbehaves. The return on consolidation compounds with every downstream query you don’t have to overpay for.