The query looked right. It wasn't.
Here's the scenario that breaks most text-to-SQL tools — and we know because we've watched it happen. A user connects their production database during onboarding. They ask for revenue by region for Q3. The query comes back fast. The numbers look reasonable. They compare it to their finance report and find a $200K gap. The AI had joined two tables on a column that existed in both — but meant different things in each. The query ran fine. The answer was wrong.
This is SQL hallucination. If you're using an AI tool to write SQL against your production database, it's the problem you need to understand.
What SQL hallucination actually is
In the AI world, "hallucination" means a model produces something that sounds right but isn't. In SQL, hallucination takes specific, predictable forms. We've tracked four types from building Chion and from watching users hit the same problems with other tools.
The phantom column
The model references a column it thinks exists based on your schema, but it doesn't — or it got renamed six months ago. Maybe it exists in the model's training data from a different database. Maybe the model guessed what the column "should" be called based on the question.
Example: You ask about "customer lifetime value." The model writes SELECT customer_lifetime_value FROM customers — but your schema stores CLV as total_revenue in a separate customer_metrics table.
Sometimes this throws a hard error (column not found). Sometimes — and this is worse — the model picks a column with a similar name that measures something different.
The wrong join
Two tables share a column name — say, 'customer_id' — but they track different things. The model joins on it anyway. The results look right. They're not. Common patterns:
- Cartesian products: The model joins two tables without a proper ON clause, producing every possible row combination. The query runs. The numbers are wildly inflated. If you don't catch it, you report 100x the actual revenue.
- Backwards joins: The model uses the right tables but joins through the wrong keys. It connects orders to products through a column that shares a name but isn't a real foreign key.
- Missing middle tables: Many relationships need a junction table. "Customers to products" goes through orders. The model might join customers to products directly, skipping orders.
The misleading aggregation
This is the sneakiest type. The SQL is correct. The join is right. But the aggregation answers a different question than what was asked.
"What's our average order value?" could mean:
- Average across all orders (simple AVG)
- Average per customer, then average of those averages (weighted differently)
- Average per month, showing a trend
- Average minus refunds and cancellations
Each version produces valid SQL and a reasonable-looking number. But only one answers what the user meant. Without context about your business logic, the model picks whatever pattern it knows best from training.
The unsafe mutation
A language model writing SQL has no concept of "read-only." Without hard constraints, it can write:
DELETE FROM users WHERE last_login < '2025-01-01'(when asked "which users haven't logged in this year?")UPDATE orders SET status = 'cancelled'(when asked "show me cancelled orders")DROP TABLE temp_reports(when asked to "clean up the reports")
In production, any of these could be a disaster. The model doesn't know the difference between "show me" and "do it."
Why prompt engineering can't fix this
We tried. Aggressive prompting brought our error rate from ~30% to ~12%. Then we hit a wall. The remaining errors weren't language problems — the model didn't have the information it needed. No prompt was going to teach it that rev_q_adj means quarterly adjusted revenue. That's when we stopped trying to fix the model and started building a fence around it.
Prompt engineering is advice. It tells the model what it should do. It can't enforce what it must do. A prompt that says "only reference columns in the schema" is a suggestion. The model can still make up a column name — it just does it less often.
Contract-based generation: the structural fix
Before we send a single token to the language model, we build a fence: here are the columns you can use, here's the allowed grain, here are the permitted filters. The model writes SQL inside that fence. If it tries to reference something outside it, that's a hard failure — not a silent wrong answer. We don't patch bad SQL. We reject it and try again with better context.
Here's how it works in practice:
Step 1: Profile the schema. Before any SQL gets written, we scan your database and build a full structural map. Every table, column, data type, foreign key relationship, value distribution, and naming pattern. This map is the source of truth.
Step 2: Build a contract. When a user asks a question, we figure out which tables and columns are relevant and build a contract — a spec of what the SQL can reference. The contract includes:
- Permitted tables
- Permitted columns (mapped to their real names and types)
- Valid join paths (verified against foreign keys)
- Allowed aggregation patterns
- Read-only enforcement (SELECT only)
- Row and cell limits
Step 3: Constrained generation. The LLM writes SQL inside the contract bounds. It can only touch tables and columns in the contract. It can only use join paths that are real. It can only produce SELECT statements.
Step 4: Post-generation validation. Even with the contract, we check the output. Every column reference is verified against the schema. Every join is checked. Every aggregation is reviewed. If something slips through, the repair cycle catches it.
What this kills
Let's map the contract back to our four hallucination types:
Phantom columns: Gone. The contract lists every permitted column by exact name. The model can't reference anything outside the list.
Wrong joins: Gone. The contract specifies valid join paths from actual foreign key relationships. The model can't invent a path — it uses one from the contract or nothing.
Misleading aggregations: Reduced by a lot. The contract includes aggregation guidance based on question structure and data types. It can't remove all ambiguity (some questions are genuinely ambiguous), but it narrows the possibilities to sensible ones.
Unsafe mutations: Gone. The contract says SELECT only. This is also enforced at the connection level (read-only credentials) and in post-generation validation. SELECT only — enforced at the DB, the contract, and post-validation.
The practical result
With contract-based generation and the full pipeline, our effective error rate is well below what raw benchmarks would predict. Most errors get caught and fixed before the SQL touches your database.
When something does get through, you always see the SQL. No black box. You can check what ran, flag the issue, and we use that to tighten the contract layer.
Some questions are genuinely ambiguous — no system, human or AI, can resolve them without a follow-up. But the gap between "usually right" and "structurally checked" matters when business decisions ride on the output.