Quality Assurance for RAG Knowledge Bases

A Real-World Case Study in AI Data Quality

Based on the CA-AIDev Bot Quality Audit Project
For students new to AI development

What is RAG?

RAG = Retrieval-Augmented Generation

Instead of relying only on what an AI model "memorized" during training, RAG lets the AI look up information from a knowledge base before answering.

Think of it like an open-book exam. The AI can search through your documents to find relevant information before crafting a response.

🧠

Without RAG

AI relies on training data (can be outdated or wrong)

📚

With RAG

AI searches your knowledge base for current, accurate info

Why Quality Matters

Garbage In = Garbage Out
If your knowledge base has bad data, your AI will give bad answers.

The Three Bots in Our Case Study

💼

BizBot

California business licensing information

👶

KiddoBot

Child care and family resources

💧

WaterBot

Water board regulations

Core Value: Every piece of information a bot returns must be accurate and verifiable.

The 5-Phase Quality Audit

We developed a systematic approach to ensure data quality:

Phase	Focus	Question Answered
1	URL Validation	Are all links still working?
2	Content Quality	Is the information still accurate?
3	Deduplication	Any duplicate or corrupt data?
4	Chunk Consistency	Are chunks the right size?
5	Query Testing	Does search return relevant results?

Like a restaurant health inspection—you check everything from ingredients (content) to storage (database) to serving (retrieval).

1 Phase 1: URL Validation

The Problem

Government websites change constantly. A link that worked last year might be broken today.

587

Broken URLs Found

Common Issues:

Sites restructured (.html → .asp)
Pages moved to new locations
Domains changed entirely
Bot protection (false positives)

🔍 Example: FTB URL Changes

California's Franchise Tax Board restructured their entire website:

OLD: ftb.ca.gov/forms/search/index.aspx?form=XXX
NEW: ftb.ca.gov/forms/search/

OLD: ftb.ca.gov/pay/business/web-pay.html
NEW: ftb.ca.gov/pay/bank-account/index.asp

We found 116 broken URLs across all three bots (15% of total)!

2 Phase 2: Content Quality Audit

Scanning for Outdated Information

We built automated scanners to flag potentially outdated content:

4,952

Total Findings

613

Critical Items

492

Actionable

🐛 Real Bug Found: ISO Standard Update

Finding: Manufacturing guide referenced ISO 13485:2003

Problem: The 2003 version was withdrawn in 2016!

Fix: Updated to ISO 13485:2016

Without this audit, users would have been pointed to an obsolete standard.

Key Insight: Not all "old dates" are problems. Historical drought documentation (2012-2016) is intentionally historical. Context matters!

3 Phase 3: Deduplication & Embedding Integrity

What Are Embeddings?

Embeddings are like GPS coordinates for text. They turn words into numbers so the AI can measure how "close" two pieces of text are in meaning.

What We Checked

Check	BizBot	KiddoBot	WaterBot
Duplicate content	0 ✓	0 ✓	0 ✓
NULL embeddings	0 ✓	0 ✓	0 ✓
Wrong dimensions	0 ✓	0 ✓	0 ✓

All checks passed! No remediation required for Phase 3.

Expected dimensions: 1536 (OpenAI's text-embedding-ada-002 model)

4 Phase 4: Chunk & Metadata Consistency

What is Chunking?

Documents are too big to search all at once. We break them into "chunks"—like cutting a pizza into slices. Each slice should be just the right size: big enough to be useful, small enough to be specific.

Chunk Size Analysis

Bot	Total Chunks	Undersized	Ideal Range %
BizBot	425	33 (7.8%)	92.2%
KiddoBot	1,390	14 (1%)	98.9%
WaterBot	1,253	1 (0.08%)	99.9% ⭐

WaterBot = Gold Standard! 99.9% ideal chunk sizes, 100% metadata coverage. This became our reference for best practices.

Those 33 undersized BizBot chunks? They became the star of Phase 5...

5 Phase 5: Query Coverage Testing

The Ultimate Test: Adversarial Queries

Do real user queries actually return relevant results? But there's a catch...

⚠️ The Circular Validation Trap: Testing with queries you wrote while creating content is meaningless—of course it answers questions it was designed to answer!

Non-Circular Methodology:

Source queries from real user forums
Use government FAQ pages (not your content)
75 adversarial queries total
Test similarity threshold: ≥0.40

Real User Questions:

"Is hard water bad for my health?"
"Stage 1 vs Stage 2 childcare?"
"I got a violation notice—what now?"
"Handyman $500 exemption rules?"

Initial Results Were Shocking:
WaterBot coverage: 64% (target: 90%) — It had regulatory content but zero consumer FAQ coverage!

🔬 The Big Discovery: 33% Duplicates

The Real Problem

BizBot "exceeded" its 500+ chunk target... but 33% of content was duplicate. We hit the target by counting the same content multiple times!

The Vanity Metrics Trap

Bot	Reported	Actually Unique	Duplicates
BizBot	637	425	212 (33%)
KiddoBot	1,482	1,402	80 (5%)
WaterBot	1,489	1,401	88 (6%)

Root cause: Arbitrary targets ("500+ chunks") incentivized quantity over quality. The ingestion scripts had no deduplication check.

🧮 Why Circular Validation Gives False Confidence

Imagine testing a French-English dictionary by looking up words you picked FROM the dictionary. Of course you'll find them! Real users search for words they DON'T know are in there.

What We Did Wrong (Initially)

❌ Circular Testing

"FTB business registration requirements"

Query designed around our content
Uses same terminology we wrote
100% coverage (meaningless)

✅ Adversarial Testing

"Is hard water bad for my health?"

Real question from Reddit
Uses consumer vocabulary
0% coverage (real problem!)

Key Insight: WaterBot had 1,400+ chunks of regulatory content but zero consumer FAQ content. Users don't ask about TMDLs—they ask about chlorine smell.

🔧 The Fixes: Three Critical Changes

1️⃣ Deduplication

DELETE WHERE id NOT IN (
  SELECT MIN(id) 
  GROUP BY md5(content)
);

Result: 380 duplicate rows removed

2️⃣ Content Gap Fill

Created 25 consumer FAQ docs for WaterBot:

Hard water, chlorine smell
How to read CCRs
Violation notice response

3️⃣ IVFFlat Index Rebuild (Critical!)

New documents existed in the database but returned 0 results in similarity search. pgvector's IVFFlat indexes don't automatically include new rows!

REINDEX INDEX schema.embedding_idx;  -- REQUIRED after bulk inserts!

After fixes: All three bots at 100% adversarial query coverage.

📚 Key Learnings

1️⃣

Vanity Metrics Create Duplicates

Arbitrary chunk count targets (500+, 1400+) incentivize gaming the metric. Measure query success rate, not row counts.

2️⃣

Context Classification Matters

Not all "old dates" are errors. Historical content (drought timelines, past regulations) is intentionally historical. Build smart filters.

3️⃣

Vocabulary Mismatch is Real

Users say "Stage 1 vs Stage 2", but content says "CalWORKs Child Care Stages". Consider synonyms and query suggestions.

4️⃣

Test With Real Queries

Unit tests on data aren't enough. You must test the actual retrieval flow with realistic user questions.

🛠️ Practical Tips for Your RAG Projects

✅ Before Loading Data

Set a MIN_CHUNK_SIZE (100+ chars recommended)
Validate URLs before embedding
Check for duplicate content hashes
Ensure consistent metadata schema

🔍 During Operation

Log queries with fewer than 5 results
Monitor for "zero result" queries
Track which content never gets retrieved
Set up periodic URL validation

🧪 Quality Gates Checklist

☐ Zero duplicate rows
☐ Zero NULL embeddings
☐ All embeddings correct dimensions
☐ Zero stubs (<100 chars)
☐ 90%+ query coverage
☐ URLs return 2xx status codes

📊 Final Results Dashboard

Query Coverage by Bot (After Remediation)

Requirement	BizBot	KiddoBot	WaterBot
URLs validated	✓	✓	✓
Content verified	✓	✓	✓
Zero duplicates	✓	✓	✓
Zero duplicates	✓	✓	✓
Query coverage ≥90%	100% ✓	100% ✓	100% ✓

✅ READY FOR PRODUCTION

📖 Resources & Next Steps

Key Concepts to Learn

Vector Embeddings - How AI represents text as numbers
Cosine Similarity - How AI measures "closeness"
Chunking Strategies - How to split documents
Semantic Search - Finding meaning, not just keywords

Tools Used in This Project

PostgreSQL + pgvector - Vector database
OpenAI Embeddings - text-embedding-ada-002
Python Scripts - Validation automation
Playwright - URL verification

Want to Try This Yourself?

Start with a small knowledge base (10-20 documents). Run through all 5 phases. You'll learn more from finding and fixing real issues than from reading about them!

Questions? 🙋‍♀️