How Chion Works: The 13-Phase Verified SQL Pipeline

This is the full pipeline — every step, every decision, every place where we chose to do something differently. We wrote it because we believe you should be able to verify everything before trusting it.

Pipeline Architecture

A multi-phase AI pipeline — not a wrapper around another LLM.

Action Legend

Click a category to filter the pipeline. Click again to show all steps.

Setup

One-time configuration by the user.

00

Connect with read-only credentials. That’s the first constraint we enforce — not the last. We scan your schema, not your data. We never see a row until you ask a question that requires one. PostgreSQL is live today. BigQuery, Snowflake, and MySQL are on the roadmap.

Row-Level Security (RLS) is honored end-to-end. Chion connects with read-only credentials stored in an encrypted vault. No data leaves your database — only metadata and query results pass through the pipeline.

Automatic Profiling

Runs once per data source — no user involvement after setup.

01

Connects to your database and scans every table, column, and constraint. Builds a complete structural map before any AI touches the data.

All scans respect your RLS policies. Foreign keys, indexes, and constraint metadata are captured to inform join paths downstream.

02

Measures cardinality, null rates, min/max ranges, and value distributions for every column. Classifies each as temporal, metric, categorical, or identifier.

03

An LLM reads each column in context and assigns human-readable labels — "Revenue by Quarter," not revenue_q. Embeddings are generated for RAG retrieval.

Currently powered by Anthropic Claude. The architecture is model-agnostic — OpenAI, Google, and on-premise options are on the roadmap. The model receives column names, types, and statistical profiles. No raw row data is sent.

04

Top values per column are sampled, ranked by frequency, and embedded into a vector index. This powers fuzzy entity matching at query time.

Embeddings are generated via the same LLM provider used in Step 3. Vectors are stored in Supabase pgvector for sub-50ms retrieval.

Query Time

Triggered each time the user asks a question.

05

You type a plain-English question — "Revenue by region last quarter" or "Top 10 customers by churn risk." This is the second and final user interaction.

06

Your natural-language question is parsed through 7 discovery strategies — entity lookup, comparison, top-K ranking, time-bounded analysis, and more.

The LLM classifies query type, extracts hard constraints, and identifies entities. All strategies run in parallel for speed.

07

User terms are resolved to physical columns via vector search. An adaptive scorer fills a 1,200-token context window — no hard thresholds, just natural score breaks.

Vector search uses the embeddings built in Steps 3–4. The adaptive scorer uses distribution analysis and normalization — not a fixed cutoff.

08

The system evaluates the query structure, result shape, and number of series to select the best chart type. Compatibility scoring ranks 8 chart types — single line, multi-series line, dual-axis line, stacked area, standard bar, grouped bar, stacked bar, and animated bar race.

Each chart type has a compatibility function that scores the match (≥0.7 strong, ≥0.4 acceptable). The fallback chain is line → bar → scatter → table. Selection is deterministic — same data shape always produces the same chart type. Dual-axis and multi-series layouts are assigned automatically based on series count and scale divergence.

09

A structural boundary is generated — allowed columns, filters, aggregations, grain, and ordering. The SQL generator cannot introduce out-of-contract elements.

The contract enforces RLS-aware column access. Out-of-contract references are a hard failure, not a silent omission. This contract layer is what makes the pipeline model-agnostic — the LLM generates SQL against the contract, not against your data.

10

Read-only SQL is generated against the contract, lint-checked for safety, and verified. A 3-layer defense enforces row budgets, blocks SELECT *, and caps output at 1,000 rows / 12,000 cells.

L1: pre-validation (200-bucket budget). L2: runtime lint (block SELECT *, enforce LIMIT). L3: post-execution truncation with disclosure. All queries are SELECT-only — enforced by assertReadOnlySelect. The same verification stack runs regardless of which model generated the SQL. Provider choice affects latency and cost, not security posture.

Results

Delivered back to the user.

11

SQL runs against your live database with RLS enforced. If it fails, up to 3 repair cycles rewrite the query with error context. Still failing? A hard failure with actionable suggestions.

Repairs are deterministic rewrites using the error message and contract — not random retries. Credentials are vault-resolved per execution.

12

Results render as interactive D3.js charts with zoom, pan, and dual-axis support. A grounded narrative references actual data — no hallucinated summaries.

The narrative is generated by an LLM but grounded in the actual query results. Every claim traces back to a row. Chart type was selected in Step 8 — this step renders it with full interactivity.

    Unexpected Application Error!

    Unexpected token '<', "<!DOCTYPE "... is not valid JSON

    SyntaxError: Unexpected token '<', "<!DOCTYPE "... is not valid JSON