The problem with naive aggregation
Most SQL generators pick aggregation functions based on the question text alone. "Average salary" → AVG. "Total revenue" → SUM. But what happens when the model applies SUM to a categorical column, or AVG to an identifier?
Column type classification
During schema profiling, every column gets classified:
- Quantitative: numeric columns where arithmetic makes sense (revenue, salary, count)
- Temporal: date/timestamp columns that drive GROUP BY grain
- Categorical: string/enum columns with bounded cardinality (region, status, plan_name)
- Identifier: primary keys, foreign keys, unique identifiers
Aggregation rules
The SQL contract enforces aggregation rules based on column type:
- Quantitative → SUM, AVG, MIN, MAX, PERCENTILE_CONT
- Categorical → COUNT, COUNT(DISTINCT)
- Temporal → GROUP BY grain (day/week/month/quarter/year)
- Identifier → never aggregated, used only in JOIN conditions and WHERE filters
The budget algorithm
Chion's aggregation engine uses a bi-dimensional (grain, K) search algorithm to stay within the row budget (≤1,000 rows / 12,000 cells). If the natural grain produces too many rows, the pipeline coarsens the time grain before applying TopK — never auto-filtering dates.
Why this matters
Invalid aggregations produce numbers that look right but are meaningless. SUMming a categorical column returns a number — but it's nonsense. The contract prevents this at the structural level, not the prompt level.