<aside> 💡
TLDR
Works great if all of your JSON files have same schema or all JSON files can be loaded into table at the same time using read_json("**/*.json", union_by_name = True)
function.
</aside>
To make this problem concrete, I’ve downloaded database of all CVEs from ‣ using git clone
. I want to create API that can query this repository. I can think of the following available options.
grep
based tool e.g. git grep
, rg
, grep
etc.SQL
either using a third party tool or using a custom solution.I’d have preferred 1 or 2 if I was writing a cli application. Since I am building a RESTful API, I’d prefer 3 since queries will be easier to write and integrate into other applications. Custom solution is also not a bad idea if existing tooling is not good enough or there are other constraints. Most databases like PostgreSQL and sqlite3 allows JSON to be inserted and queried.
While searching for such plugins, I came across https://duckdb.org/ which seems to support this use cases. See for example https://duckdb.org/docs/data/multiple_files/overview 💯.
For this exercise, I am not interested in performance. I can use caching to improve performance drastically later is need arise. I am looking for a solution that has the best DX — and if I am lucky doesn’t easily allow stupid mistakes!
The installation was a breeze. Single binary! It can also be used as Rust/Python library and third party integration for PHP are also available 💯.
$ wget <https://github.com/duckdb/duckdb/releases/download/v1.1.3/duckdb_cli-linux-amd64.zip>
$ unzip duckdb_cli-linux-amd64.zip
Archive: duckdb_cli-linux-amd64.zip
inflating: duckdb
(PY311) [dilawar@rasmalai cve_database_git (master)]$ ./duckdb
v1.1.3 19864453f7
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D
Great, seems to work!
For a sanity check, I did a “loopback” — print what you read. Just to make sure that engine is parsing JSON files. I passed a glob that matches all JSON files. Loading all files together raised an error — mismatch in schema.
D SELECT * FROM 'cves/**/*.json';
Invalid Input Error: JSON transform error in file "cves/2001/1xxx/CVE-2001-1517.json", in record/value 1: Object {"affected":[{"product":"n/a","vendor":"n/a","vers... has unknown key "tags"
Try increasing 'sample_size', reducing 'maximum_depth', specifying 'columns', 'format' or 'records' manually, setting 'ignore_errors' to true, or setting 'union_by_name' to true when reading multiple files with a different structure.
D
Fair enough warning about inconsistent schema across files and it suggested what I should explore. I like tools that hints at what to do in case of error💯.
I am not sure which one is the best option here though: ignore_errors
or perhaps union_by_name
is a better idea🤔? I am going ahead with union_by_name
. I had to tweak the query little bit: SELECT * FROM read_json('cves/2000/**/*.json', union_by_name = true);
Nice!
Note that we are not yet inserting these JSON record into to a SQL table but rather processing them on the fly. I say this because I don’t see the opened database test.duck.db
size to increase at all!