A Real-World Case Study in AI Data Quality
Based on the CA-AIDev Bot Quality Audit Project
For students new to AI development
Instead of relying only on what an AI model "memorized" during training, RAG lets the AI look up information from a knowledge base before answering.
AI relies on training data (can be outdated or wrong)
AI searches your knowledge base for current, accurate info
California business licensing information
Child care and family resources
Water board regulations
Core Value: Every piece of information a bot returns must be accurate and verifiable.
We developed a systematic approach to ensure data quality:
| Phase | Focus | Question Answered |
|---|---|---|
| 1 | URL Validation | Are all links still working? |
| 2 | Content Quality | Is the information still accurate? |
| 3 | Deduplication | Any duplicate or corrupt data? |
| 4 | Chunk Consistency | Are chunks the right size? |
| 5 | Query Testing | Does search return relevant results? |
Government websites change constantly. A link that worked last year might be broken today.
California's Franchise Tax Board restructured their entire website:
OLD: ftb.ca.gov/forms/search/index.aspx?form=XXX
NEW: ftb.ca.gov/forms/search/
OLD: ftb.ca.gov/pay/business/web-pay.html
NEW: ftb.ca.gov/pay/bank-account/index.asp
We found 116 broken URLs across all three bots (15% of total)!
We built automated scanners to flag potentially outdated content:
Finding: Manufacturing guide referenced ISO 13485:2003
Problem: The 2003 version was withdrawn in 2016!
Fix: Updated to ISO 13485:2016
Without this audit, users would have been pointed to an obsolete standard.
| Check | BizBot | KiddoBot | WaterBot |
|---|---|---|---|
| Duplicate content | 0 โ | 0 โ | 0 โ |
| NULL embeddings | 0 โ | 0 โ | 0 โ |
| Wrong dimensions | 0 โ | 0 โ | 0 โ |
Expected dimensions: 1536 (OpenAI's text-embedding-ada-002 model)
| Bot | Total Chunks | Undersized | Ideal Range % |
|---|---|---|---|
| BizBot | 425 | 33 (7.8%) | 92.2% |
| KiddoBot | 1,390 | 14 (1%) | 98.9% |
| WaterBot | 1,253 | 1 (0.08%) | 99.9% โญ |
WaterBot = Gold Standard! 99.9% ideal chunk sizes, 100% metadata coverage. This became our reference for best practices.
Those 33 undersized BizBot chunks? They became the star of Phase 5...
Do real user queries actually return relevant results? But there's a catch...
BizBot "exceeded" its 500+ chunk target... but 33% of content was duplicate. We hit the target by counting the same content multiple times!
| Bot | Reported | Actually Unique | Duplicates |
|---|---|---|---|
| BizBot | 637 | 425 | 212 (33%) |
| KiddoBot | 1,482 | 1,402 | 80 (5%) |
| WaterBot | 1,489 | 1,401 | 88 (6%) |
Root cause: Arbitrary targets ("500+ chunks") incentivized quantity over quality. The ingestion scripts had no deduplication check.
"FTB business registration requirements"
"Is hard water bad for my health?"
DELETE WHERE id NOT IN (
SELECT MIN(id)
GROUP BY md5(content)
);
Result: 380 duplicate rows removed
Created 25 consumer FAQ docs for WaterBot:
New documents existed in the database but returned 0 results in similarity search. pgvector's IVFFlat indexes don't automatically include new rows!
REINDEX INDEX schema.embedding_idx; -- REQUIRED after bulk inserts!
After fixes: All three bots at 100% adversarial query coverage.
Arbitrary chunk count targets (500+, 1400+) incentivize gaming the metric. Measure query success rate, not row counts.
Not all "old dates" are errors. Historical content (drought timelines, past regulations) is intentionally historical. Build smart filters.
Users say "Stage 1 vs Stage 2", but content says "CalWORKs Child Care Stages". Consider synonyms and query suggestions.
Unit tests on data aren't enough. You must test the actual retrieval flow with realistic user questions.
MIN_CHUNK_SIZE (100+ chars recommended)| Requirement | BizBot | KiddoBot | WaterBot |
|---|---|---|---|
| URLs validated | โ | โ | โ |
| Content verified | โ | โ | โ |
| Zero duplicates | โ | โ | โ |
| Zero duplicates | โ | โ | โ |
| Query coverage โฅ90% | 100% โ | 100% โ | 100% โ |
โ READY FOR PRODUCTION
Start with a small knowledge base (10-20 documents). Run through all 5 phases. You'll learn more from finding and fixing real issues than from reading about them!
Questions? ๐โโ๏ธ