See downloaded paper at Bespoke OLAP.pdf. The code is available on GitHub
This paper describes generating a bespoke OLAP database engine that performs optimally for a given workload and dataset. While we have known that generality in DBMS implementations means slowdown, bespoke implementations are rare due to the high cost of implementing a DBMS engine. This changes with LLM agents.
Simply prompting an LLM with all the data doesn’t work properly due to the complexity and interdependent nature of components. So, a pipeline is used:
- First, given only the parameterized queries and dataset, the agent decides a storage layout.
- After finalizing that, it proceeds to write C++ structs for that and ingestion code
- It then proceeds to do a basic implementation ensuring correctness by validating queries against a reference engine like DuckDB.
- After that, optimization is performed for every query template. Each query template is worked on by an individual agent i.e. chat context and they update patches in a global codebase.
- After updates, the harness runs a benchmark and reverts the change if there’s a regression. This is fed back to the agent which can take a different direction
One important infrastructure provided here is that of a running database system that can be hotpatched. This is important to have shorter feedback loops — the agent can directly patch the database system and test out performance improvements. Data doesn’t need to be reingested either if no changes are made to storage layer.
How to handle workload change?
Here, two separate mechanisms have been provided depending on the situation:
- If there’s an ad-hoc query that isn’t supported by our bespoke database, use a generic SQL processor that will work over the storage layer to serve this query
- Somehow, the paper mentions that this query will be around same efficiency as DuckDB since the storage layer preserves the relational semantics of the data
- If there’s a workload drift, i.e. new queries have come in or the queries have changed, one should resynthesize the database implementation. This is feasible since the synthesis takes hours instead of days and costs around 10s or 100 dollars.