RAG (Retrieval-Augmented Generation) combines the power of semantic search with LLMs to answer questions grounded in your own data. Instead of relying solely on the model’s training data, RAG retrieves relevant documents first, then uses them as context for generation. This post walks through building a complete RAG pipeline using pgvector for vector storage, Ollama for local embeddings, and Claude for answer generation.
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ RAG Pipeline │
├─────────────────────────────────────────────────────────────┤
│ Query: "What is RAG?" │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Ollama │ ← Embedding (text → vector) │
│ │ nomic-embed-text │ │
│ └─────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ pgvector Search │ ← Find similar documents │
│ │ (cosine distance) │ │
│ └─────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Claude Haiku │ ← Generate answer with context │
│ │ (LangChain) │ │
│ └─────────────────────┘ │
│ │ │
│ ▼ │
│ Answer │
└─────────────────────────────────────────────────────────────┘
Project Setup
Initialize the project with uv:
uv init learn-rag
cd learn-rag
uv add langchain langchain-anthropic langchain-core pgvector psycopg2-binary python-dotenv requests tenacity
Create a .env file:
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/learn_rag
ANTHROPIC_API_KEY=sk-ant-...
Setting Up PostgreSQL with pgvector
We use Docker to run PostgreSQL with the pgvector extension:
# docker-compose.yml
services:
postgres:
image: pgvector/pgvector:pg17
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: learn_rag
ports:
- "5432:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 5s
retries: 5
Start the database:
docker-compose up -d
Create the documents table with vector support:
# connection.py
import psycopg2
import os
from dotenv import load_dotenv
load_dotenv()
try:
conn = psycopg2.connect(os.getenv("DATABASE_URL"))
cur = conn.cursor()
# Enable pgvector extension
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
# Create documents table with 768-dimensional vectors
cur.execute("DROP TABLE IF EXISTS documents CASCADE;")
cur.execute("""
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
metadata JSONB,
embedding VECTOR(768),
created_at TIMESTAMPTZ DEFAULT NOW()
);
""")
conn.commit()
print("Table 'documents' created successfully.")
except Exception as error:
print("Error:", error)
finally:
cur.close()
conn.close()
The VECTOR(768) type stores 768-dimensional vectors — the output size of nomic-embed-text.
Installing Ollama and Pulling the Embedding Model
Ollama runs LLMs and embedding models locally. Install it:
brew install ollama
brew services start ollama
Pull the embedding model:
ollama pull nomic-embed-text
nomic-embed-text is a 137M parameter model that produces 768-dimensional embeddings. It’s fast, runs on CPU, and delivers solid retrieval quality.
Understanding Embeddings
Embeddings convert text into dense vectors where semantically similar texts have similar vectors:
"Python is a programming language" → [0.023, -0.156, 0.089, ...]
"coding language" → [0.019, -0.148, 0.092, ...]
"Docker containers" → [-0.234, 0.067, -0.123, ...]
The first two vectors are close (similar meaning), while the third is far away (different topic). This enables semantic search — finding documents by meaning, not just keywords.
Why Different Models Have Different Dimensions
| Model | Dimensions | Quality | Speed |
|---|---|---|---|
all-MiniLM-L6-v2 | 384 | Good | Very fast |
nomic-embed-text | 768 | Better | Fast |
mxbai-embed-large | 1024 | Best | Medium |
Higher dimensions don’t always mean better quality. Model architecture and training data matter more. Choose based on your latency and quality requirements.
Generating and Storing Embeddings
# search.py
import os
import psycopg2
import requests
from dotenv import load_dotenv
from pgvector.psycopg2 import register_vector
from tenacity import retry, stop_after_attempt, wait_exponential
load_dotenv()
OLLAMA_URL = "http://localhost:11434/api/embeddings"
EMBEDDING_MODEL = "nomic-embed-text"
TOP_K = 3
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def get_embedding(text: str) -> list[float]:
"""Get embedding from Ollama API with retry logic."""
response = requests.post(
OLLAMA_URL,
json={"model": EMBEDDING_MODEL, "prompt": text},
timeout=30,
)
response.raise_for_status()
return response.json()["embedding"]
def search_similar_documents(query: str, top_k: int = TOP_K) -> list[dict]:
"""Search for similar documents using vector similarity."""
query_vector = get_embedding(query)
conn = psycopg2.connect(os.getenv("DATABASE_URL"))
register_vector(conn)
cur = conn.cursor()
try:
cur.execute(
"""
SELECT id, content, metadata, 1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s;
""",
(query_vector, query_vector, top_k),
)
rows = cur.fetchall()
return [
{
"id": row[0],
"content": row[1],
"metadata": row[2],
"similarity": row[3],
}
for row in rows
]
finally:
cur.close()
conn.close()
Key points:
register_vector(conn): Registers the vector type with psycopg2<=>operator: Computes cosine distance (smaller = more similar)1 - distance: Converts to similarity score (higher = more similar)@retry: Handles transient network failures with exponential backoff
Inserting Documents
# insert_embeddings.py
import os
import json
import psycopg2
import requests
from dotenv import load_dotenv
from pgvector.psycopg2 import register_vector
load_dotenv()
SAMPLE_DOCUMENTS = [
{
"content": "Python is a high-level programming language known for its simplicity.",
"metadata": {"source": "wiki", "topic": "programming"},
},
{
"content": "PostgreSQL is a powerful open-source relational database.",
"metadata": {"source": "wiki", "topic": "database"},
},
{
"content": "RAG combines retrieval with LLMs for grounded generation.",
"metadata": {"source": "paper", "topic": "ai"},
},
]
def main():
# Generate embeddings
vectors = [get_embedding(doc["content"]) for doc in SAMPLE_DOCUMENTS]
conn = psycopg2.connect(os.getenv("DATABASE_URL"))
register_vector(conn)
cur = conn.cursor()
for i, doc in enumerate(SAMPLE_DOCUMENTS):
cur.execute(
"INSERT INTO documents (content, metadata, embedding) VALUES (%s, %s, %s)",
(doc["content"], json.dumps(doc["metadata"]), vectors[i]),
)
conn.commit()
print(f"Inserted {len(SAMPLE_DOCUMENTS)} documents.")
Building the RAG Pipeline
Now we connect everything — retrieval + generation:
# main.py
from dotenv import load_dotenv
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from search import search_similar_documents
load_dotenv()
LLM_MODEL = "claude-haiku-4-5-20251001"
TOP_K = 3
def rag_query(question: str, top_k: int = TOP_K) -> str:
"""Complete RAG pipeline."""
# Step 1: Retrieve relevant documents
documents = search_similar_documents(question, top_k)
# Step 2: Build context from documents
context = "\n\n".join([
f"Document {i+1} (relevance: {doc['similarity']:.2f}):\n{doc['content']}"
for i, doc in enumerate(documents)
])
# Step 3: Generate answer using LLM
llm = ChatAnthropic(model_name=LLM_MODEL, timeout=30)
prompt = ChatPromptTemplate.from_messages([
("system", """You are a helpful assistant that answers questions based on the provided context.
Use ONLY the information from the context to answer. If the context doesn't contain enough information, say so.
Be concise and direct in your answers."""),
("human", """Context:
{context}
Question: {question}
Answer:"""),
])
chain = prompt | llm
response = chain.invoke({"context": context, "question": question})
return response.content
The system prompt constrains the LLM to only use retrieved context, reducing hallucinations.
Interactive CLI
Wrap it in an interactive loop:
def main():
print("\n╔══════════════════════════════════════════════════╗")
print("║ RAG Query Assistant ║")
print("║ Type 'quit' or 'exit' to stop ║")
print("╚══════════════════════════════════════════════════╝\n")
while True:
try:
user_input = input("You > ").strip()
except (EOFError, KeyboardInterrupt):
print("\n\nGoodbye!")
break
if not user_input:
continue
if user_input.lower() in ("quit", "exit"):
print("\nGoodbye!")
break
answer = rag_query(user_input)
print(f"\nAssistant > {answer}\n")
print("-" * 50)
if __name__ == "__main__":
main()
Testing the Pipeline
uv run main.py
╔══════════════════════════════════════════════════╗
║ RAG Query Assistant ║
║ Type 'quit' or 'exit' to stop ║
╚══════════════════════════════════════════════════╝
You > What is RAG?
Retrieved 3 documents:
1. [similarity=0.7197] RAG combines retrieval with LLMs...
Assistant > RAG stands for Retrieval-Augmented Generation.
It combines retrieval with LLMs for grounded generation,
meaning it first finds relevant documents, then uses them
as context to generate accurate answers.
--------------------------------------------------
You > exit
Semantic Search vs Keyword Search
Notice how semantic search handles synonyms and related concepts:
| Query | Top Match | Similarity |
|---|---|---|
| ”coding language" | "Python is a programming language” | 0.70 |
| ”database management" | "PostgreSQL is a relational database” | 0.72 |
| ”postgres” (single keyword) | “PostgreSQL…“ | 0.56 (lower) |
Single keywords work less well because embedding models are trained on sentences. For better keyword matching, consider hybrid search (combining full-text search with vector search).
Why Ollama for Embeddings?
| Aspect | Ollama | Cloud APIs |
|---|---|---|
| Privacy | Data stays local | Sent to external servers |
| Cost | Free | Pay per token |
| Latency | Fast (no network) | Network overhead |
| Offline | Works offline | Requires internet |
Ollama keeps the model warm in memory, so subsequent requests are nearly instant.
Why Not Claude for Embeddings?
Anthropic (Claude) doesn’t offer embedding models — they focus on chat/completion. Use:
- Ollama:
nomic-embed-text,mxbai-embed-large - OpenAI:
text-embedding-3-large - Cohere:
embed-v4 - Google:
text-embedding-004
Next Steps
- Hybrid search: Combine keyword (tsvector) + semantic (pgvector) search
- Chunking: Split large documents into smaller pieces before embedding
- Reranking: Use a cross-encoder to rerank retrieved documents
- Streaming: Stream LLM responses for better UX
- Caching: Cache embeddings for repeated queries
Conclusion
RAG grounds LLM responses in your data, reducing hallucinations and enabling question-answering over private documents. With pgvector for storage, Ollama for local embeddings, and LangChain for orchestration, you can build production-ready RAG pipelines that run entirely on your infrastructure.
The full code is available in the learn-rag repository.