Quality Assurance for RAG Knowledge Bases

A Real-World Case Study in AI Data Quality

Based on the CA-AIDev Bot Quality Audit Project
For students new to AI development

What is RAG?

RAG = Retrieval-Augmented Generation

Instead of relying only on what an AI model "memorized" during training, RAG lets the AI look up information from a knowledge base before answering.

Think of it like an open-book exam. The AI can search through your documents to find relevant information before crafting a response.
๐Ÿง 

Without RAG

AI relies on training data (can be outdated or wrong)

๐Ÿ“š

With RAG

AI searches your knowledge base for current, accurate info

Why Quality Matters

Garbage In = Garbage Out
If your knowledge base has bad data, your AI will give bad answers.

The Three Bots in Our Case Study

๐Ÿ’ผ

BizBot

California business licensing information

๐Ÿ‘ถ

KiddoBot

Child care and family resources

๐Ÿ’ง

WaterBot

Water board regulations

Core Value: Every piece of information a bot returns must be accurate and verifiable.

The 5-Phase Quality Audit

We developed a systematic approach to ensure data quality:

Phase Focus Question Answered
1 URL Validation Are all links still working?
2 Content Quality Is the information still accurate?
3 Deduplication Any duplicate or corrupt data?
4 Chunk Consistency Are chunks the right size?
5 Query Testing Does search return relevant results?
Like a restaurant health inspectionโ€”you check everything from ingredients (content) to storage (database) to serving (retrieval).

1 Phase 1: URL Validation

The Problem

Government websites change constantly. A link that worked last year might be broken today.

587
Broken URLs Found

Common Issues:

  • Sites restructured (.html โ†’ .asp)
  • Pages moved to new locations
  • Domains changed entirely
  • Bot protection (false positives)
๐Ÿ” Example: FTB URL Changes

California's Franchise Tax Board restructured their entire website:

OLD: ftb.ca.gov/forms/search/index.aspx?form=XXX
NEW: ftb.ca.gov/forms/search/

OLD: ftb.ca.gov/pay/business/web-pay.html
NEW: ftb.ca.gov/pay/bank-account/index.asp

We found 116 broken URLs across all three bots (15% of total)!

2 Phase 2: Content Quality Audit

Scanning for Outdated Information

We built automated scanners to flag potentially outdated content:

4,952
Total Findings
613
Critical Items
492
Actionable
๐Ÿ› Real Bug Found: ISO Standard Update

Finding: Manufacturing guide referenced ISO 13485:2003

Problem: The 2003 version was withdrawn in 2016!

Fix: Updated to ISO 13485:2016

Without this audit, users would have been pointed to an obsolete standard.

Key Insight: Not all "old dates" are problems. Historical drought documentation (2012-2016) is intentionally historical. Context matters!

3 Phase 3: Deduplication & Embedding Integrity

What Are Embeddings?

Embeddings are like GPS coordinates for text. They turn words into numbers so the AI can measure how "close" two pieces of text are in meaning.

What We Checked

Check BizBot KiddoBot WaterBot
Duplicate content 0 โœ“ 0 โœ“ 0 โœ“
NULL embeddings 0 โœ“ 0 โœ“ 0 โœ“
Wrong dimensions 0 โœ“ 0 โœ“ 0 โœ“
All checks passed! No remediation required for Phase 3.

Expected dimensions: 1536 (OpenAI's text-embedding-ada-002 model)

4 Phase 4: Chunk & Metadata Consistency

What is Chunking?

Documents are too big to search all at once. We break them into "chunks"โ€”like cutting a pizza into slices. Each slice should be just the right size: big enough to be useful, small enough to be specific.

Chunk Size Analysis

Bot Total Chunks Undersized Ideal Range %
BizBot 425 33 (7.8%) 92.2%
KiddoBot 1,390 14 (1%) 98.9%
WaterBot 1,253 1 (0.08%) 99.9% โญ

WaterBot = Gold Standard! 99.9% ideal chunk sizes, 100% metadata coverage. This became our reference for best practices.

Those 33 undersized BizBot chunks? They became the star of Phase 5...

5 Phase 5: Query Coverage Testing

The Ultimate Test: Adversarial Queries

Do real user queries actually return relevant results? But there's a catch...

โš ๏ธ The Circular Validation Trap: Testing with queries you wrote while creating content is meaninglessโ€”of course it answers questions it was designed to answer!

Non-Circular Methodology:

  1. Source queries from real user forums
  2. Use government FAQ pages (not your content)
  3. 75 adversarial queries total
  4. Test similarity threshold: โ‰ฅ0.40

Real User Questions:

  • "Is hard water bad for my health?"
  • "Stage 1 vs Stage 2 childcare?"
  • "I got a violation noticeโ€”what now?"
  • "Handyman $500 exemption rules?"
Initial Results Were Shocking:
WaterBot coverage: 64% (target: 90%) โ€” It had regulatory content but zero consumer FAQ coverage!

๐Ÿ”ฌ The Big Discovery: 33% Duplicates

The Real Problem

BizBot "exceeded" its 500+ chunk target... but 33% of content was duplicate. We hit the target by counting the same content multiple times!

The Vanity Metrics Trap

Bot Reported Actually Unique Duplicates
BizBot 637 425 212 (33%)
KiddoBot 1,482 1,402 80 (5%)
WaterBot 1,489 1,401 88 (6%)

Root cause: Arbitrary targets ("500+ chunks") incentivized quantity over quality. The ingestion scripts had no deduplication check.

๐Ÿงฎ Why Circular Validation Gives False Confidence

Imagine testing a French-English dictionary by looking up words you picked FROM the dictionary. Of course you'll find them! Real users search for words they DON'T know are in there.

What We Did Wrong (Initially)

โŒ Circular Testing

"FTB business registration requirements"

  • Query designed around our content
  • Uses same terminology we wrote
  • 100% coverage (meaningless)

โœ… Adversarial Testing

"Is hard water bad for my health?"

  • Real question from Reddit
  • Uses consumer vocabulary
  • 0% coverage (real problem!)
Key Insight: WaterBot had 1,400+ chunks of regulatory content but zero consumer FAQ content. Users don't ask about TMDLsโ€”they ask about chlorine smell.

๐Ÿ”ง The Fixes: Three Critical Changes

1๏ธโƒฃ Deduplication

DELETE WHERE id NOT IN (
  SELECT MIN(id) 
  GROUP BY md5(content)
);

Result: 380 duplicate rows removed

2๏ธโƒฃ Content Gap Fill

Created 25 consumer FAQ docs for WaterBot:

  • Hard water, chlorine smell
  • How to read CCRs
  • Violation notice response

3๏ธโƒฃ IVFFlat Index Rebuild (Critical!)

New documents existed in the database but returned 0 results in similarity search. pgvector's IVFFlat indexes don't automatically include new rows!

REINDEX INDEX schema.embedding_idx;  -- REQUIRED after bulk inserts!

After fixes: All three bots at 100% adversarial query coverage.

๐Ÿ“š Key Learnings

1๏ธโƒฃ

Vanity Metrics Create Duplicates

Arbitrary chunk count targets (500+, 1400+) incentivize gaming the metric. Measure query success rate, not row counts.

2๏ธโƒฃ

Context Classification Matters

Not all "old dates" are errors. Historical content (drought timelines, past regulations) is intentionally historical. Build smart filters.

3๏ธโƒฃ

Vocabulary Mismatch is Real

Users say "Stage 1 vs Stage 2", but content says "CalWORKs Child Care Stages". Consider synonyms and query suggestions.

4๏ธโƒฃ

Test With Real Queries

Unit tests on data aren't enough. You must test the actual retrieval flow with realistic user questions.

๐Ÿ› ๏ธ Practical Tips for Your RAG Projects

โœ… Before Loading Data
  • Set a MIN_CHUNK_SIZE (100+ chars recommended)
  • Validate URLs before embedding
  • Check for duplicate content hashes
  • Ensure consistent metadata schema
๐Ÿ” During Operation
  • Log queries with fewer than 5 results
  • Monitor for "zero result" queries
  • Track which content never gets retrieved
  • Set up periodic URL validation
๐Ÿงช Quality Gates Checklist
  • โ˜ Zero duplicate rows
  • โ˜ Zero NULL embeddings
  • โ˜ All embeddings correct dimensions
  • โ˜ Zero stubs (<100 chars)
  • โ˜ 90%+ query coverage
  • โ˜ URLs return 2xx status codes

๐Ÿ“Š Final Results Dashboard

Query Coverage by Bot (After Remediation)

Requirement BizBot KiddoBot WaterBot
URLs validated โœ“ โœ“ โœ“
Content verified โœ“ โœ“ โœ“
Zero duplicates โœ“ โœ“ โœ“
Zero duplicates โœ“ โœ“ โœ“
Query coverage โ‰ฅ90% 100% โœ“ 100% โœ“ 100% โœ“

โœ… READY FOR PRODUCTION

๐Ÿ“– Resources & Next Steps

Key Concepts to Learn

  • Vector Embeddings - How AI represents text as numbers
  • Cosine Similarity - How AI measures "closeness"
  • Chunking Strategies - How to split documents
  • Semantic Search - Finding meaning, not just keywords

Tools Used in This Project

  • PostgreSQL + pgvector - Vector database
  • OpenAI Embeddings - text-embedding-ada-002
  • Python Scripts - Validation automation
  • Playwright - URL verification

Want to Try This Yourself?

Start with a small knowledge base (10-20 documents). Run through all 5 phases. You'll learn more from finding and fixing real issues than from reading about them!

Questions? ๐Ÿ™‹โ€โ™€๏ธ