Back to Resources

How Chion's Aggregation Rules Engine Works

Deep dive into Chion's aggregation rules: column type classification drives SUM vs AVG vs COUNT selection. How the pipeline prevents invalid aggregations.

By Jonathan Dag··8 min read

The problem with naive aggregation

Most SQL generators pick aggregation functions based on the question text alone. "Average salary" → AVG. "Total revenue" → SUM. But what happens when the model applies SUM to a categorical column, or AVG to an identifier?

Column type classification

During schema profiling, every column gets classified:

  • Quantitative: numeric columns where arithmetic makes sense (revenue, salary, count)
  • Temporal: date/timestamp columns that drive GROUP BY grain
  • Categorical: string/enum columns with bounded cardinality (region, status, plan_name)
  • Identifier: primary keys, foreign keys, unique identifiers

Aggregation rules

The SQL contract enforces aggregation rules based on column type:

  • Quantitative → SUM, AVG, MIN, MAX, PERCENTILE_CONT
  • Categorical → COUNT, COUNT(DISTINCT)
  • Temporal → GROUP BY grain (day/week/month/quarter/year)
  • Identifier → never aggregated, used only in JOIN conditions and WHERE filters

The budget algorithm

Chion's aggregation engine uses a bi-dimensional (grain, K) search algorithm to stay within the row budget (≤1,000 rows / 12,000 cells). If the natural grain produces too many rows, the pipeline coarsens the time grain before applying TopK — never auto-filtering dates.

Why this matters

Invalid aggregations produce numbers that look right but are meaningless. SUMming a categorical column returns a number — but it's nonsense. The contract prevents this at the structural level, not the prompt level.