Retrieval-Augmented Generation (RAG) lets you point an AI at your own documents, SOPs, and knowledge base and get accurate answers — without hallucination. Here is the full technical walkthrough of how we built one that scores 91% accuracy on our internal evaluation set.
Thinkiyo Studio
December 28, 2025 · 9 min read
Retrieval-Augmented Generation (RAG) is one of the most practical AI patterns in enterprise software right now. The core idea: instead of relying on an LLM's training data (which is out of date and doesn't know anything about your business), you give it access to your own documents at query time. The LLM retrieves relevant passages, reads them, and answers — grounded in your actual knowledge.
Done well, RAG eliminates hallucination on domain-specific questions and gives you a system that can answer questions about your SOPs, product documentation, pricing, client history, compliance policies, or anything else you can put in a document.
Done badly, it retrieves the wrong passages, answers with false confidence, and erodes trust quickly.
This is a detailed walkthrough of the system we built for a 60-person professional services firm that was drowning in repetitive internal questions. The goal: a Slack-accessible knowledge base that staff could query in plain English and get accurate answers from the company's internal documentation.
The system now handles 200+ queries per day and scores 91% on our evaluation harness.
The firm had extensive documentation: SOPs for every major process, a client onboarding handbook, a compliance manual, pricing and scope guidelines, an HR policy document, and years of project notes. In total, roughly 400 documents averaging 8 pages each.
The problem: nobody could find anything. Staff would ask the same questions in Slack over and over. Senior people were constantly interrupted with "where do I find the SLA template?" or "what's our policy on X?" Onboarding new staff took weeks because they couldn't navigate the documentation.
The goal was not to replace the documentation — it was to make it instantly accessible.
The system has four components:
The firm's documents were scattered across: Google Drive (the majority), Confluence (some technical docs), and a SharePoint folder (legacy documents). We needed to ingest all three sources.
We built three source connectors in Python:
Each document, once downloaded, goes through a preprocessing step:
def preprocess_document(raw_text: str, metadata: dict) -> dict:
# Remove boilerplate (headers, footers, page numbers)
cleaned = remove_boilerplate(raw_text)
# Normalise whitespace
cleaned = normalise_whitespace(cleaned)
# Extract structured metadata where possible
detected_metadata = extract_metadata(cleaned)
return {
"text": cleaned,
"metadata": {**metadata, **detected_metadata},
"word_count": len(cleaned.split()),
"source": metadata["source"],
"last_modified": metadata["last_modified"]
}
We stored processed documents in a PostgreSQL table with full text and metadata. This serves as the source of truth and makes re-indexing cheap — we can re-embed without re-ingesting.
This metadata is crucial for filtering at query time — more on that below.
Chunking is where most RAG implementations fail. If your chunks are too large, you retrieve too much irrelevant text and the LLM gets confused. If they are too small, you lose context and the retrieved passage doesn't make sense on its own.
Fixed-size chunking (500 tokens, 100 token overlap): Simple, fast, but breaks sentences mid-thought. Performance was noticeably worse than semantic chunking.
Paragraph-based chunking: Better than fixed-size, but document formatting was inconsistent across sources. Some "paragraphs" were 50 words; some were 800 words.
What we settled on: hierarchical chunking with overlap
For each document, we:
The section heading stored with each chunk is critical — it provides context that helps the retrieval step and the LLM understand where in the document the chunk comes from.
For very short sections (under 100 tokens), we merge them with the following section to avoid creating chunks that are too small to be useful.
Final stats: 400 documents → approximately 12,000 chunks.
We evaluated three embedding models:
| Model | Dimensions | Cost | Quality (on our eval set) |
|---|---|---|---|
| OpenAI text-embedding-ada-002 | 1536 | ~$0.10 per 1M tokens | Baseline |
| OpenAI text-embedding-3-small | 1536 | ~$0.02 per 1M tokens | +4% vs ada-002 |
| Cohere embed-english-v3 | 1024 | ~$0.10 per 1M tokens | +2% vs ada-002 |
We went with text-embedding-3-small — better quality than ada-002, much cheaper, and OpenAI's recommended default as of 2025.
Embedding all 12,000 chunks cost approximately $0.15 total. Re-embedding is cheap enough to do whenever we update the chunking strategy.
We used Pinecone as the vector database. Alternatives we considered:
Pinecone's managed service was the right call for this use case — we wanted to focus on the application logic, not database management.
Dimensions: 1536
Metric: cosine
Pod type: s1.x1 (starter, easily upgraded)
Namespaces: one per document category (sop, policy, template, etc.)
{
"document_id": "abc123",
"document_title": "Client Onboarding SOP v3.2",
"section_heading": "Section 4: Account Setup",
"chunk_index": 7,
"source": "google_drive",
"category": "sop",
"department": "client_success",
"last_modified": "2025-11-01"
}
When a staff member sends a question (via Slack slash command), the query pipeline:
You are an internal knowledge assistant for [Company]. Answer the following question based ONLY on the provided context. If the answer is not in the context, say "I don't have this information in our knowledge base" — do not make up an answer.
Context:
[CHUNK 1 — Document: Client Onboarding SOP, Section: Account Setup]
[text...]
[CHUNK 2 — Document: Service Agreement Template, Section: Payment Terms]
[text...]
[CHUNK 3 — ...]
[CHUNK 4 — ...]
Question: [user's question]
Answer:
The instruction to say "I don't have this information" rather than guess is critical. Without it, the LLM will hallucinate answers from its training data when the retrieved context doesn't contain the answer.
This is the part most teams skip, and it is the reason most RAG implementations underperform.
We built a test set of 200 question-answer pairs by:
We run the evaluation harness on every change to the pipeline (chunking strategy, prompt, re-ranking model). Each evaluation run tests:
Starting accuracy: 74%. After iterating on chunking strategy, re-ranking, and prompt: 91%.
The biggest single improvement came from adding the re-ranking step — it improved accuracy by 8 percentage points. The second biggest improvement came from metadata filtering — routing the query to the correct namespace rather than searching all documents.
Start with the evaluation harness. We built it after the first version of the system was live. Building it first would have saved three weeks of iteration time.
Invest in document quality. The documents with the most retrieval failures were the ones that were poorly structured — no headings, dense walls of text, outdated content mixed with current content. Cleaning up the worst 20 documents improved accuracy more than any technical change.
Hybrid search from day one. We initially used pure vector search. Adding a BM25 keyword search component (hybrid search) and combining scores improved accuracy on specific searches (like product names, version numbers, and proper nouns) significantly.
The system now handles over 200 queries per day across the 60-person firm. Staff adoption took about two weeks to reach steady state — once people experienced getting a correct answer with a source citation in under 3 seconds, they stopped emailing each other for document help.
Estimated time saved: 15–20 hours per week across the firm, based on before/after surveys.
The 9% failure rate is handled by graceful degradation: the system says "I don't have this information" and suggests who to ask. Staff have learned to trust the "I don't know" as much as the positive answers.
Share this article
20-minute call. No pitch deck. Just a direct look at where automation ships ROI fastest.