How We Built a RAG Knowledge Base That Answers 91% of Internal Questions Correctly

Retrieval-Augmented Generation (RAG) is one of the most practical AI patterns in enterprise software right now. The core idea: instead of relying on an LLM's training data (which is out of date and doesn't know anything about your business), you give it access to your own documents at query time. The LLM retrieves relevant passages, reads them, and answers — grounded in your actual knowledge.

Done well, RAG eliminates hallucination on domain-specific questions and gives you a system that can answer questions about your SOPs, product documentation, pricing, client history, compliance policies, or anything else you can put in a document.

Done badly, it retrieves the wrong passages, answers with false confidence, and erodes trust quickly.

This is a detailed walkthrough of the system we built for a 60-person professional services firm that was drowning in repetitive internal questions. The goal: a Slack-accessible knowledge base that staff could query in plain English and get accurate answers from the company's internal documentation.

The system now handles 200+ queries per day and scores 91% on our evaluation harness.

The Problem

The firm had extensive documentation: SOPs for every major process, a client onboarding handbook, a compliance manual, pricing and scope guidelines, an HR policy document, and years of project notes. In total, roughly 400 documents averaging 8 pages each.

The problem: nobody could find anything. Staff would ask the same questions in Slack over and over. Senior people were constantly interrupted with "where do I find the SLA template?" or "what's our policy on X?" Onboarding new staff took weeks because they couldn't navigate the documentation.

The goal was not to replace the documentation — it was to make it instantly accessible.

Architecture Overview

The system has four components:

Document ingestion pipeline — processes and stores documents with metadata
Embedding and indexing pipeline — chunks documents, generates embeddings, stores in vector database
Query pipeline — handles incoming questions, retrieves relevant chunks, calls the LLM
Evaluation harness — measures accuracy so we can track improvements

Step 1: Document Ingestion

The firm's documents were scattered across: Google Drive (the majority), Confluence (some technical docs), and a SharePoint folder (legacy documents). We needed to ingest all three sources.

Source connectors

We built three source connectors in Python:

Google Drive: Used the Google Drive API to list and download files. Filtered by MIME type — we processed Google Docs (converted to plain text), PDFs, and Markdown files. Skipped spreadsheets and presentations.
Confluence: Used the Confluence REST API to pull pages by space key. Stripped HTML to plain text using BeautifulSoup.
SharePoint: Used Microsoft Graph API. Same filtering as Google Drive.

Each document, once downloaded, goes through a preprocessing step:

def preprocess_document(raw_text: str, metadata: dict) -> dict:
    # Remove boilerplate (headers, footers, page numbers)
    cleaned = remove_boilerplate(raw_text)
    # Normalise whitespace
    cleaned = normalise_whitespace(cleaned)
    # Extract structured metadata where possible
    detected_metadata = extract_metadata(cleaned)

    return {
        "text": cleaned,
        "metadata": {**metadata, **detected_metadata},
        "word_count": len(cleaned.split()),
        "source": metadata["source"],
        "last_modified": metadata["last_modified"]
    }

We stored processed documents in a PostgreSQL table with full text and metadata. This serves as the source of truth and makes re-indexing cheap — we can re-embed without re-ingesting.

Metadata we captured per document

Document title
Source system (drive, confluence, sharepoint)
Last modified date
Author (where available)
Category (we manually tagged 20 categories: SOP, Policy, Template, Reference, etc.)
Department

This metadata is crucial for filtering at query time — more on that below.

Step 2: Chunking Strategy

Chunking is where most RAG implementations fail. If your chunks are too large, you retrieve too much irrelevant text and the LLM gets confused. If they are too small, you lose context and the retrieved passage doesn't make sense on its own.

What we tried (and why we moved on)

Fixed-size chunking (500 tokens, 100 token overlap): Simple, fast, but breaks sentences mid-thought. Performance was noticeably worse than semantic chunking.

Paragraph-based chunking: Better than fixed-size, but document formatting was inconsistent across sources. Some "paragraphs" were 50 words; some were 800 words.

What we settled on: hierarchical chunking with overlap

For each document, we:

Split into logical sections using heading detection (H1/H2 in Markdown, heading styles in Google Docs)
For each section, apply a sliding window chunk of 350–450 tokens with 75-token overlap
Each chunk stores: the chunk text, the section heading it belongs to, the document title, and all document metadata

The section heading stored with each chunk is critical — it provides context that helps the retrieval step and the LLM understand where in the document the chunk comes from.

For very short sections (under 100 tokens), we merge them with the following section to avoid creating chunks that are too small to be useful.

Final stats: 400 documents → approximately 12,000 chunks.

Step 3: Embedding Model Selection

We evaluated three embedding models:

Model	Dimensions	Cost	Quality (on our eval set)
OpenAI text-embedding-ada-002	1536	~$0.10 per 1M tokens	Baseline
OpenAI text-embedding-3-small	1536	~$0.02 per 1M tokens	+4% vs ada-002
Cohere embed-english-v3	1024	~$0.10 per 1M tokens	+2% vs ada-002

We went with text-embedding-3-small — better quality than ada-002, much cheaper, and OpenAI's recommended default as of 2025.

Embedding all 12,000 chunks cost approximately $0.15 total. Re-embedding is cheap enough to do whenever we update the chunking strategy.

Step 4: Pinecone Setup

We used Pinecone as the vector database. Alternatives we considered:

pgvector (PostgreSQL extension): Great for self-hosting, slightly worse query performance at scale
Qdrant: Excellent open-source option, requires more infrastructure management
Weaviate: Feature-rich but more complex to set up

Pinecone's managed service was the right call for this use case — we wanted to focus on the application logic, not database management.

Index configuration

Dimensions: 1536
Metric: cosine
Pod type: s1.x1 (starter, easily upgraded)
Namespaces: one per document category (sop, policy, template, etc.)

Metadata stored with each vector

{
  "document_id": "abc123",
  "document_title": "Client Onboarding SOP v3.2",
  "section_heading": "Section 4: Account Setup",
  "chunk_index": 7,
  "source": "google_drive",
  "category": "sop",
  "department": "client_success",
  "last_modified": "2025-11-01"
}

Step 5: Query Pipeline

When a staff member sends a question (via Slack slash command), the query pipeline:

Classifies the query: What category of document is this likely asking about? (LLM call with a fast model)
Generates the query embedding: embed the question using the same model used for documents
Retrieves top-K chunks: query Pinecone with metadata filters (category from step 1, department if applicable), retrieve top 8 chunks by cosine similarity
Re-ranks: use a cross-encoder re-ranking step to reorder the 8 chunks by relevance to the actual question (this significantly improved accuracy)
Generates the answer: pass the top 4 re-ranked chunks to Claude with a carefully designed prompt
Returns the answer with citations: the response includes the document name and section for each chunk used

The prompt structure

You are an internal knowledge assistant for [Company]. Answer the following question based ONLY on the provided context. If the answer is not in the context, say "I don't have this information in our knowledge base" — do not make up an answer.

Context:
[CHUNK 1 — Document: Client Onboarding SOP, Section: Account Setup]
[text...]

[CHUNK 2 — Document: Service Agreement Template, Section: Payment Terms]
[text...]

[CHUNK 3 — ...]

[CHUNK 4 — ...]

Question: [user's question]

Answer:

The instruction to say "I don't have this information" rather than guess is critical. Without it, the LLM will hallucinate answers from its training data when the retrieved context doesn't contain the answer.

Step 6: The Evaluation Harness

This is the part most teams skip, and it is the reason most RAG implementations underperform.

We built a test set of 200 question-answer pairs by:

Having five staff members each submit 20 questions they had actually asked in Slack in the past month
Having subject-matter experts write the correct answers, citing the source document and section
Adding 100 edge cases: questions that span multiple documents, questions with no answer in the knowledge base, ambiguous questions

We run the evaluation harness on every change to the pipeline (chunking strategy, prompt, re-ranking model). Each evaluation run tests:

Accuracy (does the answer match the ground truth?): scored by an LLM judge with a rubric
Hallucination rate (did the model make up information not in the context?): flagged by comparing the answer against retrieved chunks
Coverage (for questions that have answers, how often did we retrieve the right chunks?): measured by checking if the source document appears in the top-4 retrieved chunks
Refusal rate (for questions with no answer, how often did we correctly say "I don't have this"?): compared against the "no answer" questions in the test set

Starting accuracy: 74%. After iterating on chunking strategy, re-ranking, and prompt: 91%.

The biggest single improvement came from adding the re-ranking step — it improved accuracy by 8 percentage points. The second biggest improvement came from metadata filtering — routing the query to the correct namespace rather than searching all documents.

What We Would Do Differently

Start with the evaluation harness. We built it after the first version of the system was live. Building it first would have saved three weeks of iteration time.

Invest in document quality. The documents with the most retrieval failures were the ones that were poorly structured — no headings, dense walls of text, outdated content mixed with current content. Cleaning up the worst 20 documents improved accuracy more than any technical change.

Hybrid search from day one. We initially used pure vector search. Adding a BM25 keyword search component (hybrid search) and combining scores improved accuracy on specific searches (like product names, version numbers, and proper nouns) significantly.

The Result

The system now handles over 200 queries per day across the 60-person firm. Staff adoption took about two weeks to reach steady state — once people experienced getting a correct answer with a source citation in under 3 seconds, they stopped emailing each other for document help.

Estimated time saved: 15–20 hours per week across the firm, based on before/after surveys.

The 9% failure rate is handled by graceful degradation: the system says "I don't have this information" and suggests who to ask. Staff have learned to trust the "I don't know" as much as the positive answers.

How We Built a RAG Knowledge Base That Answers 91% of Internal Questions Correctly

How We Built a RAG Knowledge Base That Answers 91% of Internal Questions Correctly

The Problem

Architecture Overview

Step 1: Document Ingestion

Source connectors

Metadata we captured per document

Step 2: Chunking Strategy

What we tried (and why we moved on)

Step 3: Embedding Model Selection

Step 4: Pinecone Setup

Index configuration

Metadata stored with each vector

Step 5: Query Pipeline

The prompt structure

Step 6: The Evaluation Harness

What We Would Do Differently

The Result

Why Most AI Automation Projects Fail (And How to Avoid It)

Let's look at your workflows.