DuckDB is my new Excel

I’ll admit it, as a data engineer I still use Excel. It’s great for exploring unknown datasets or CSVs from stakeholders. There is something just so convenient about seeing data in a tabular format in a tool that everyone has, knows and uses. I’m here to tell you that you can 10x your Excel powers without even using AI. As much as I like the tabular format, the downsides are also very real:

My CSVs are always differently formatted, so I struggle with turning them from text into rows
If I have multiple CSVs of the same data (e.g. multiple months or years) there’s a lot of manual work in making the columns match
If there’s JSON or Parquet data involved, forget it. No Excel.
If the data is nested, it’s equally painful
Collaborating with others on the same dataset is a mess of v3_final_dumky_final.xlsx files.

With DuckDB we can tackle all of this with no programming required. If you can remember Excel function names, you can remember the SQL function names you’ll use in DuckDB.

Don’t believe me? Here’s how I query 1.3 million rows over 8 CSV files from different countries containing tracker usage data.

SQL code of processing tracker usage CSVs with DuckDB .

If you’re curious, follow along and see what you can do. We will:

Figure out where to best run DuckDB for you
Learn some key commands to play around with your data
Profit

How and where to run DuckDB

Just like Excel, DuckDB has multiple ways of running it. You can run DuckDB as an application on your computer, in a browser or use it to collaborate with multiple people in a cloud data warehouse using Motherduck.

You can run it on your own machine by following the instructions on duckdb.org and then running duckdb -ui to get started. Or create a new database-as-file with duckdb my-playground.duckdb -ui.

Of course you can use any application that connects to a database to query your data like DBeaver or even Tableau. But I think the notebooks of the DuckDB UI are a really nice way to explore data just like you would in Excel.

TLDR; If you don’t want to download DuckDB on your computer, just create a free account on Motherduck.com and run it in your browser.

Getting started: opening a CSV from anywhere

To get started working with data in DuckDB you don’t need more than a CSV file. It doesn’t even have to be a file on your computer as you can easily reference files on a website too. That of course is very convenient if you always want to make sure you have the latest version. Here is how you can access a CSV with metadata for all the freely available books on Project Gutenberg.

create or replace table pg_raw as (select * from read_csv('https://www.gutenberg.org/cache/epub/feeds/pg_catalog.csv'));
select * from pg_raw limit 3;

This will give you ~80K rows with title, author and subjects. It is important to also create the table and not just select from the CSV directly every time because otherwise you will download the same file over and over.

Issued	Title	Language	Authors	Subjects	LoCC	Bookshelves
1971-12-01	The Declaration of Independence of the United States of America	en	Jefferson, Thomas, 1743-1826	United States – History – Revolution, 1775-1783 – Sources; United States. Declaration of Independence	E201; JK	Politics; American Revolutionary War; United States Law; Category: Essays, Letters & Speeches; Category: History - American; Category: History - Modern (1750+); Category: Philosophy & Ethics; Category: Politics

I’m not sure if this happens to you, but I always find that I want to split data from a cell and analyze it at a different granularity. In this case I want to know what subjects are most common, but the subjects are separated by a -- symbol. In Excel I would have to go through a few different steps to split this and then pivot or aggregate the individual subjects. In DuckDB (or SQL in general) I can do it in three simple lines:

from pg_raw, unnest(split(Subjects, ' -- ')) as s
select s.unnest as subject, count(*)
group by 1 order by 2 desc

This will give me a summary of the most popular subjects:

Subject	count
Fiction	16047
History	9333
Juvenile fiction	5576
19th century	4355
Periodicals	3324

Now, what is this sorcery you might ask. Let’s go through these three lines to see what it does.

from pg_raw, unnest(split(Subjects, ' -- ')) as s

In the previous statement we had created a table called pg_raw which had a column named Subjects. We split this column on the separator --. Now, crucially, this gives us a list of subjects instead of a single line of text. Using unnest means that we go from one row per book to one row per item in the subject list. The , means that we cross join the data to itself. That means that the value of the unnested column changes for each item in the list, but all the other values remain the same. So we basically get one full row per subject instead of one row per book.

DuckDB allows you to start with from <table_name> instead of the traditional select * from <table_name>. I prefer this approach because it allows you to auto-complete column names in the select part.

Let’s take the second part in one go.

select s.unnest as subject, count(*)
group by subject order by 2 desc

We first select the unnested column and rename it to subject. Because we group by this first column we can count how many rows are present for each value of the subject column. Finally we order by the number of rows with the highest first.

Next Level CSV-ing: combining multiple CSV files in one line

For your local files and files on cloud storage (S3, GCS), DuckDB allows you to use pattern matching. You can replace part of the path with a * to fetch all files that match the pattern. So if I have a folder with multiple years of sales data I can just use /sales/sales_for_year_*.csv to get all files in one go. If I also have a folder per country I can just do /sales/**/sales_for_year_*.csv to also look in every subfolder.

The following example uses this pattern to fetch publically available data on privacy and trackers from Ghostery’s WhoTracksMe project on Amazon S3.

create or replace table whotracksme as (
  from read_csv('s3://data.whotracks.me/20**-01/global/sites_trackers.csv', union_by_name=true, filename = true) 
  select *
)

This can take a while depending on your internet connection as it will download a few large files with websites and trackers for Januari of each year, i.e. 2018-01 to 2025-01. There are two other things that are interesting to note here. First we union by name, that is, we combine each file not by the position of the column (as would happen in Excel), but by the header name of the column. This allows for changing columns and column positions over time, something that’s notoriously hard to solve with Excel. Secondly, by using filename = true we add the filename itself as a column. This allows us to use the name of the file as input in our data, for example when only the filename contains information about the date or year.

You can see which files are present in the table by running select distinct filename from whotracksme

Next we want to again summarize our data, because that’s where it get interesting. We’re interested to see how the number of trackers on sites changes over time. There are a few things we need to do to fit our needs. First, by simply taking the first 4 characters of the month column, we get the year the data was collected, which we can then store as a number. Secondly, we want one row per site per year, but we also want some more information on which trackers are actually used. Therefore, we don’t just count, but we also aggregate the tracker names in a list with array_agg. This allows us to keep way more information per row than what would be possible in Excel, without having to do weird string concatenations and splits.

from whotracksme
select 
  site, 
  cast(month[1:4] as INT) as yr,
  country, 
  round(avg(site_rank)) as avg_site_rank, 
  count(distinct tracker) as num_trackers,
  array_agg(distinct tracker) as trackers,
group by all order by site, yr desc

This gives us a nicely formatted table with sites, years and trackers.

site	yr	country	avg_site_rank	num_trackers	trackers
000webhost.com	2024	global	7.0	12	[hotjar, …]
0123movie.art	2025	global	13.0	24	[google_fonts, …]
0123movie.net	2025	global	9.0	17	[cloudflare, …]
0123movies.com	2022	global	9.0	18	[jw_player, …]
…	…	…	…	…	[…]

The impossible (in Excel): JSON data as tables

You might not be familiar with JSON data, but it is everywhere. Basically every site uses it in some form or another to represent the data you see on the page. If you want to see what it looks like, for example, Reddit allows you to simply turn any page into a JSON format. So if you go to r/duckdb.json you’ll find all posts in machine-readable format.

Reading this into Excel is pretty much impossible, but DuckDB does not mind at all. The following reads the newest 100 posts into a table.

create or replace table reddit_duckdb as (
  select * from read_json_auto('https://www.reddit.com/r/duckdb/new.json?limit=100')
)

If you want to fetch more posts, you can have a look at the Reddit API documentation to see how you can use the after and before parameters in the URL to scroll through the pages of comments.

Since it is one single object with multiple nestings, we have to then unnest it fully to get to all the 50 or so different columns per post from which we can then transform and select as we like.

with unnested as (
  select unnest(data.children, recursive := true) from reddit_duckdb
)

select 
  title, 
  strftime(to_timestamp(created_utc), '%Y-%m-%d') created_date, 
  ups, 
  downs, 
  upvote_ratio, 
  num_comments, 
  domain, 
  url, 
  author, 
from unnested
order by created_utc desc

This again, gives us a neatly formatted table to explore:

title	created_date	ups	upvote_ratio	num_comments	domain	author
Solving the Character Encoding Issue When Reading DuckDB via ODBC in Excel VBA	2025-10-30	3	0.81	0	self.DuckDB	redraiment
Notepad++	2025-10-25	1	0.67	10	self.DuckDB	DESERTWATTS
The story behind how DNB moved off Databricks	2025-10-24	15	0.9	0	marimo.io	akshayka
Valentina Studio & Valentina DuckDB Server 16.1 Supports DuckDB 1.4.1	2025-10-24	7	0.89	0	i.redd.it	lynnfredricks
rusty-sheet: A DuckDB Extension for Reading Excel, WPS, and OpenDocument Files	2025-10-23	33	0.98	2	self.DuckDB	redraiment

And that folks, is how you leave Excel behind and never look back.