Update #2

The hidden cost of
ask anything analytics

We built an analytics agent that converts natural language to SQL. It works remarkably well, most of the time. Then we discovered a fundamental problem.

Max ChehabMax Chehab
6 min read

The variability problem

When a user asks "show me page views," our agent can generate valid SQL in many different ways. Each interpretation is defensible. The problem isn't that any version is wrong, it's that the interpretation varies between invocations.

Same question, three different answers

Monday, 9am
"Show me page views"
SELECT COUNT(*)
FROM events
WHERE type = 'pageview'
124,892
Monday, 2pm
"Show me page views"
SELECT SUM(page_views)
FROM analytics
WHERE date = TODAY
89,441
Tuesday, 10am
"Show me page views"
SELECT COUNT(DISTINCT
  session_id)
FROM ga4_events
WHERE event = 'page_view'
67,203

For a data engineer, this is annoying. For a marketing manager who can't read SQL? It's a serious problem.

Why this matters for non-technical users

The core issue is an asymmetry in ability to evaluate output quality. The most technical users can verify the SQL. Those without the technical know-how see a chart and a number, and the AI's confidence masks the underlying uncertainty.

Semantic driftHIGH

Same question → different SQL over time → incomparable trends

Table selection ambiguityMEDIUM

Multiple valid tables (raw events vs. facts) → inconsistent choice

Aggregation logic varianceMEDIUM

COUNT vs. SUM vs. COUNT DISTINCT → fundamentally different questions

Verified metrics

Our solution is to constrain what the agent can do. Instead of generating SQL from scratch, it selects from a library of pre-defined, human-verified metrics. The critical property is determinism: given the same metric and parameters, you get identical SQL every time.

Before: Raw AI SQL
  • Different SQL every time
  • AI guesses which tables to use
  • Calculation logic varies
  • No way to audit what happened
After: Verified Metrics
  • Same SQL every time, guaranteed
  • Human-verified table selection
  • Documented calculation logic
  • Full explanation of assumptions

Semantic retrieval with LLM validation

Finding the right metric uses a two-stage process.

Stage 1: Embedding similarity

We generate embeddings for user queries and metric intents, then use pgvector to find candidates. "Top pages", "most viewed content", and "popular articles" all match the same metric.

Stage 2: LLM validation

Embedding similarity alone produces false positives. We pass candidates through an LLM that evaluates whether each metric can actually answer the query. About 15% of candidates get filtered out here: 15% fewer wrong answers.

Metric detection

We can't pre-define metrics for every possible question. The system needs to handle novel queries and learn what metrics are missing.

INPUT
User Query
"show me churn"
SEARCH
metric_finder()
0.62 < 0.75
FALLBACK
Raw SQL
SELECT ...
CAPTURE
Opportunity
pending
OUTPUT
New Metric
churn_rate

When established metrics are not found, our agent writes raw SQL to provide responsive results. We capture each of those instances as opportunities, turning common patterns into new metrics and reviewing edge cases to ensure reliability. The system learns from real usage without sacrificing consistency.

What we've learned

Validation is essential

We continue to learn how we can improve our validation techniques so that you receive the information you need. Our LLM validation catches a meaningful number of false positives to ensure reliable data.

Coverage grows fast

After a few weeks of detection, we captured the vast majority of common queries. The long tail exists, but it's smaller than we assumed.

Users notice inconsistency

Before this, we'd get tickets like "why did my conversion rate change 20% overnight?" The answer was usually different SQL. Those tickets are gone.

Know your users

This approach makes sense for non-technical users. For data engineers exploring unfamiliar schemas, raw generation might be right. Context matters.

What's next

We're still iterating. Can we use opportunity data to draft new metrics automatically? Can we build metrics that combine data across integrations? Can we surface metric documentation better so users understand exactly what they're looking at?

The core question, whether AI analytics tools should generate SQL freely or select from verified queries, doesn't have a universal answer. We recognize that our users benefit from reliable, human-validated information. That way, the AI tools can be applied exactly where they are helpful.

Ready for analytics you can trust?

Experience the difference verified metrics make. Connect your data sources and get consistent, reliable insights.