Benchmarking LLM Codegen for SQL

September 2, 2025 By Matthew Revell

In the excitement around LLM codegen for Python, JavaScript, and other general-purpose languages, SQL can seem overlooked.

Beekeeper Studio Is A Free & Open Source Database GUI

Best SQL query & editor tool I have ever used. It provides everything I need to manage my database. - ⭐⭐⭐⭐⭐ Mit

Beekeeper Studio is fast, intuitive, and easy to use. Beekeeper supports loads of databases, and works great on Windows, Mac and Linux.

Download for Desktop

Yet SQL arguably underpins the global economy. So it’s worth asking: how good are LLMs at writing SQL?

We’re not the first to ask that question. Both academics and vendors have already published benchmarks measuring LLM performance on SQL generation.

In this post, we’ll review some of those benchmarks, explore what they reveal, and consider what makes for a meaningful SQL codegen benchmark.

What makes a good benchmark for LLM SQL codegen?

Most of us have felt that mix of surprise — and maybe a little awe — when an LLM produces a working block of code on the first try. But just as often, it stumbles: calling a library that doesn’t exist, forgetting a variable, or missing an edge case spelled out in the prompt.

With SQL, the tight, iterative nature of development makes it even harder for LLMs to get it right. Usually, you write a query, execute it, inspect the output, and then refine based on what you see. And that’s because the data itself is as much a part of the process as the SQL.

That puts specific demands on any benchmark that wants to judge an LLM’s ability to generate SQL. We’d argue that a good SQL benchmark needs to:

Cover multiple databases: Postgres, MySQL, SQL Server, ClickHouse, and others all have quirks, so a fair benchmark has to span more than one engine.
Include a variety of tasks: Queries range from simple SELECT … WHERE clauses to complex joins, window functions, and schema-aware challenges. A benchmark should reflect that spectrum.
Check results, not just syntax: The real question isn’t whether the SQL executes but whether it returns the right data.
Account for iteration: A useful benchmark has to capture the back-and-forth of refining queries, not just the first attempt.
Be transparent: Without open schemas, datasets, and evaluation code, results are harder to rely on.

So, are there any AI SQL codegen benchmarks that live up to these criteria?

SQL benchmarks

Of course, SQL benchmarks predate LLMs. Perhaps the best known is Spider, released in 2018 as a dataset for the text-to-SQL research community. While it was designed for semantic parsing rather than LLMs, it has since become something of a baseline in evaluations of LLMs. However, both researchers and vendors have since introduced other benchmarks.

Spider

What it is: Released in 2018, Spider is a large-scale dataset of 200+ databases and thousands of natural-language-to-SQL tasks, originally built for the text-to-SQL research community.

Strengths: Wide schema coverage; reasonably complex, cross-domain queries; long-standing baseline for academic and LLM evaluations (including Codex, GPT-3.5, GPT-4).

Limitations: Built on clean SQLite schemas far removed from messy production databases; focuses on one-shot evaluation; doesn’t reflect iterative query refinement.

Verdict: A useful baseline and still the most reported benchmark but could be too academic to capture real-world SQL workflows.

BIRD-SQL

What it is: Introduced in 2023 as a follow-on to Spider, BIRD-SQL was designed specifically to challenge large models with more realistic BI-style workloads.

Strengths: Larger, messier schemas; queries that require schema reasoning and, in some cases, external knowledge; more demanding than Spider.

Limitations: Still academic in design; focuses on single-turn evaluation; less accessible to practitioners outside research.

Verdict: A step closer to reflecting real-world data challenges but still not iterative or multi-engine.

Tinybird SQL Benchmark

What it is: An industry-led benchmark by Tinybird, evaluating 19 LLMs on analytical SQL tasks using their own platform (built on ClickHouse), with comparisons to human baselines.

Strengths: Clear presentation; large dataset (hundreds of millions of rows); side-by-side results across multiple models.

Limitations: Tied to Tinybird’s own infrastructure; limited to one-shot query generation; results may not generalise beyond their environment.

Verdict: Polished and practical, but narrow in scope and not representative of how LLMs perform across different databases or in iterative workflows.

SQL-EVAL

What it is: SQL-EVAL is an open-source evaluation framework for assessing the correctness of SQL generated by LLMs. Each task comes with a reference SQL query that’s been checked by humans. SQL-Eval runs both the model’s query and the reference query, then compares the results.

Strengths: Fully open-source and reproducible; handles SQL complexity (nested queries, multiple joins), semantic variety, and cases where more than one correct query exists.

Limitations: Focuses on whether the query returns the same results as the reference, not on query quality or efficiency; supports only one-shot generation and doesn’t simulate iterative refinement.

Verdict: A valuable resource for judging how well models return the right results, and excellent on transparency, but it still doesn’t reflect the iterative workflows of real-world SQL use.

Other benchmarks worth looking at

Newer projects look to plug some of the gaps left by Spider and BIRD-SQL.

BIRD-CRITIC: A newer benchmark built on top of BIRD-SQL. Instead of just asking a model to write a query once, it checks whether the model can fix its own mistakes. For example: if the first attempt returns an error or the wrong result, the benchmark asks the model to correct and try again. It also does this across different database systems (Postgres, MySQL, SQL Server, Oracle).
BEAVER: A benchmark built from real queries taken from enterprise data warehouses, not from purpose built academic datasets. It includes the big, messy schemas you’d expect in finance, retail, or operations.

Pulling it all together

Each benchmark highlights a different part of the SQL challenge. Spider and BIRD-SQL standardised large-scale text-to-SQL evaluation, but on datasets far cleaner than real production systems. Tinybird shows another approach that can make benchmarks more accessible and comparative, while SQL-Eval brings in the important dimension of multi-engine coverage.

Together, they suggest the next opportunity: benchmarks that move beyond static, one-shot queries to capture the messy back and forth of SQL development.

SQL AI queries with Beekeeper Studio

Benchmarks are useful for comparing models in controlled settings but the real test is how well AI actually fits into your day to day workflow.

That’s where Beekeeper Studio comes in: it is the SQL GUI that connects to your favorite LLM to help you explore and query your data through a conversational interface and through manual SQL queries. It’s fast, open source, and cross-platform, with support for Postgres, MySQL, SQLite, MongoDB, and more.

Try it for yourself.

Download Beekeeper Studio