What text-to-SQL actually means

I spent ten years writing SQL that business users were waiting on. Same queries, different filters, different date ranges. Text-to-SQL felt like the obvious fix. Then I built one and learned why most of them fall apart in production.

The idea is simple: you type a question in English — "what were our top 10 products by revenue last quarter?" — and get back a working SQL query. Simple idea. Hard to get right.

The term covers everything from basic prompt wrappers around ChatGPT to full pipeline systems with schema profiling, contract enforcement, and multi-phase validation. Some tools produce SQL that looks right. Fewer produce SQL that is right. Even fewer produce SQL that's safe to run against your production database.

We built Chion because we kept hitting the same wall: the SQL looked correct, but it wasn't. A column name was off. A join was backwards. An aggregation was misleading. Nobody caught it until the chart was in the slide deck.

A brief history

Converting natural language to SQL isn't a new idea. Researchers started working on it in the 1970s — systems like LUNAR and CHAT-80 could handle questions against small, fixed databases. Cool demos. They didn't scale.

The modern wave kicked off around 2017 when deep learning models got good enough to map English to SQL syntax. The Spider benchmark (2018) gave the field a standardized test: 10,000+ questions across 200 databases. Everyone started competing on accuracy numbers.

Then large language models changed the game. GPT-3, Claude, Gemini — these models could generate SQL from plain English with zero training on your specific database. Just describe what you want and out comes SQL. Accuracy went up. But the failures got sneakier.

How basic approaches work

The simplest text-to-SQL setup is a prompt wrapper. You take a user's question, stuff it into a template with some schema info, and send it to an LLM. Something like:

"Given the following tables: [schema]. Write a SQL query to answer: [user question]."

The model returns SQL. You run it. Sometimes it works.

This approach has real appeal: it's fast to build, it works for simple queries, and it handles a wide range of question phrasings. If your database has five tables and your users ask simple questions, a prompt wrapper might be enough.

But most real databases aren't five tables. And most real questions aren't simple.

Where basic approaches fail

Here's what happens when a prompt-only tool hits a real database: the model sees your column is called 'rev_q_adj' but has no idea that means 'revenue adjusted for quarterly returns.' So it guesses. Sometimes it guesses right. Enough times that you ship it. Then it guesses wrong on the one query that goes into a board presentation.

We've tracked four failure patterns that keep showing up:

Phantom columns. The model references a column that doesn't exist. Maybe it saw a similar name in training data. Maybe it guessed what the column "should" be called. The query either fails with a column-not-found error — or worse, it matches a wrong column that has a similar name.

Wrong joins. Multi-table queries need to know which tables connect through which keys. The model guesses. When it guesses wrong, you get cartesian products, duplicate rows, or inflated numbers. This is the most dangerous failure because the query runs fine — it just returns wrong data.

Misleading aggregations. "What's our average order value?" sounds simple. But does the model GROUP BY customer? By month? By product category? The right grouping depends on context the model doesn't have. Different groupings produce very different numbers, and all of them are valid SQL.

Unsafe mutations. A model writing SQL doesn't understand read-only vs read-write. Without hard constraints, it can generate INSERT, UPDATE, DELETE, or DROP statements. In production, that's a disaster.

Prompt-based vs pipeline architectures

The text-to-SQL world has split into two camps. The distinction matters if you're picking a tool.

Prompt-based tools treat SQL generation as a language problem. They optimize the prompt: better schema descriptions, few-shot examples, chain-of-thought reasoning. The model does all the work. You improve by writing better prompts or using better models.

Pipeline architectures treat SQL generation as an engineering problem. They break it into stages — schema profiling, entity resolution, strategy selection, generation, validation, repair, execution — with checks at each stage. The model handles generation. Engineering handles correctness.

We're in the pipeline camp. Not because prompting doesn't work — it does, for simple stuff. But prompting alone can't guarantee the things that matter in production: column existence, valid joins, read-only enforcement, row limits.

Our 13-phase pipeline: here's what happens

When you ask Chion a question, it passes through 13 phases. This isn't complexity for fun — each phase exists because we found a type of error that simpler systems missed.

Phase 1: Schema profiling. We scan your database and map every table, column, data type, foreign key, value distribution, and sample value. This runs once at connection and updates when your schema changes.

Phase 2: Entity resolution. When you say "customers," which table? When you say "last quarter," what date range? We map natural language to specific schema objects using the profile from Phase 1.

Phase 3: Strategy selection. Not every question needs the same approach. A simple count uses different SQL patterns than a multi-table time-series comparison. We pick from 7 strategies based on question structure and the entities involved.

Phase 4: Contract generation. Before the LLM writes SQL, we create a contract — a specification of what the query is allowed to touch. It lists the permitted tables, columns, join paths, and aggregation patterns. Think of it as guardrails for SQL generation.

Phase 5: Constrained generation. This is where hallucination dies. The model writes SQL, but the contract already decided what it can touch. Wrong column? Can't reference it. Made-up join? Doesn't exist in the contract. The output is still creative — the model picks the best way to answer — but the boundaries are hard.

Phase 6: Syntax validation. We parse the SQL and confirm it's valid syntax. Catches broken queries before they hit your database.

Phase 7: Schema alignment. Every column reference gets checked against the real schema. Every table exists. Every column belongs to the table that references it.

Phase 8: Join validation. Multi-table queries get their join paths checked. Foreign key relationships must be real. Cartesian products get flagged.

Phase 9: Aggregation check. GROUP BY clauses are checked for completeness. Window functions are validated. This catches the "misleading aggregation" problem.

Phase 10: Safety enforcement. Read-only check. No INSERT, UPDATE, DELETE, DROP, ALTER, TRUNCATE. Row limits enforced (≤1,000 rows, ≤12,000 cells). Last line of defense.

Phase 11: Automatic repair. If any phase fails, the pipeline tries to fix it. Wrong column name? Fuzzy-matched to the closest real one. Bad join path? Re-routed through valid foreign keys. Most errors get fixed without starting over.

Phase 12: Execution and visualization. Verified SQL runs read-only against your database. Results show up as interactive D3.js charts. The SQL is always visible so you can see exactly what ran.

Contract-based generation: what makes us different

We kept watching the same failure. Model writes SELECT customer_lifetime_value FROM customers — except the column is called total_revenue and it's in a different table. Better prompt? Same failure, just less often. So we stopped asking the model to be careful and started telling it what it's allowed to touch. The contract is a whitelist. Everything outside it is a hard error, not a guess.

The contract defines — before the model writes anything — which columns it can use, which aggregations are valid, and what the grain of the query has to be. The model can't step outside that boundary. That's what makes the output checkable.

This works like type systems in programming. A dynamic language lets you reference any variable, and errors show up when you run the code. A typed language catches bad references before you run anything. Our contract layer is type-checking for SQL generation.

When to use text-to-SQL

Text-to-SQL isn't the right tool for every analytics job. Here's our honest take:

Good fit: Ad-hoc questions where you need a quick answer. Exploration where you don't know what you're looking for. Teams where data access is stuck behind a few analysts. Cases where "good enough, fast" beats "perfect, next sprint."

Less ideal fit: Regulated reporting with exact formatting rules. Executive dashboards that need pixel-perfect layouts and scheduled refreshes. Queries that run daily — just write the SQL once and schedule it.

Worst fit: Write operations. ETL pipelines. Database admin work. If you need to change data, use a migration tool. Text-to-SQL is for reading.

Getting started

If you want to try verified text-to-SQL against your PostgreSQL database, you can connect to Chion in under two minutes. We support PostgreSQL today. BigQuery, Snowflake, and MySQL are next on the roadmap.

The free tier gives you enough credits to see if the pipeline handles your schema and question patterns. No credit card required.

What is Text-to-SQL? A Plain-English Guide for Data Teams

What text-to-SQL actually means

A brief history

How basic approaches work

Where basic approaches fail

Prompt-based vs pipeline architectures

Our 13-phase pipeline: here's what happens

Contract-based generation: what makes us different

When to use text-to-SQL

Getting started

Try Chion

Related posts

Why AI SQL Tools Hallucinate — and How We Fixed It

How to Connect PostgreSQL to AI Analytics Without Exposing Your Data

Chion vs AskYourDatabase: Verified SQL vs Chat SQL