AI Backend 10 min read

Build a RAG System with LangChain, pgvector, and Ollama

Hoang Dang Tan Phat (Kane)

Hoang Dang Tan Phat (Kane)

Feb 16, 2026

RAG (Retrieval-Augmented Generation) combines the power of semantic search with LLMs to answer questions grounded in your own data. Instead of relying solely on the model’s training data, RAG retrieves relevant documents first, then uses them as context for generation. This post walks through building a complete RAG pipeline using pgvector for vector storage, Ollama for local embeddings, and Claude for answer generation.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                      RAG Pipeline                           │
├─────────────────────────────────────────────────────────────┤
│  Query: "What is RAG?"                                      │
│         │                                                   │
│         ▼                                                   │
│  ┌─────────────────────┐                                    │
│  │ Ollama              │  ← Embedding (text → vector)       │
│  │ nomic-embed-text    │                                    │
│  └─────────────────────┘                                    │
│         │                                                   │
│         ▼                                                   │
│  ┌─────────────────────┐                                    │
│  │ pgvector Search     │  ← Find similar documents          │
│  │ (cosine distance)   │                                    │
│  └─────────────────────┘                                    │
│         │                                                   │
│         ▼                                                   │
│  ┌─────────────────────┐                                    │
│  │ Claude Haiku        │  ← Generate answer with context    │
│  │ (LangChain)         │                                    │
│  └─────────────────────┘                                    │
│         │                                                   │
│         ▼                                                   │
│  Answer                                                     │
└─────────────────────────────────────────────────────────────┘

Project Setup

Initialize the project with uv:

uv init learn-rag
cd learn-rag
uv add langchain langchain-anthropic langchain-core pgvector psycopg2-binary python-dotenv requests tenacity

Create a .env file:

DATABASE_URL=postgresql://postgres:postgres@localhost:5432/learn_rag
ANTHROPIC_API_KEY=sk-ant-...

Setting Up PostgreSQL with pgvector

We use Docker to run PostgreSQL with the pgvector extension:

# docker-compose.yml
services:
  postgres:
    image: pgvector/pgvector:pg17
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: learn_rag
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 5

Start the database:

docker-compose up -d

Create the documents table with vector support:

# connection.py
import psycopg2
import os
from dotenv import load_dotenv

load_dotenv()

try:
    conn = psycopg2.connect(os.getenv("DATABASE_URL"))
    cur = conn.cursor()

    # Enable pgvector extension
    cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")

    # Create documents table with 768-dimensional vectors
    cur.execute("DROP TABLE IF EXISTS documents CASCADE;")
    cur.execute("""
    CREATE TABLE documents (
        id SERIAL PRIMARY KEY,
        content TEXT NOT NULL,
        metadata JSONB,
        embedding VECTOR(768),
        created_at TIMESTAMPTZ DEFAULT NOW()
    );
    """)

    conn.commit()
    print("Table 'documents' created successfully.")

except Exception as error:
    print("Error:", error)
finally:
    cur.close()
    conn.close()

The VECTOR(768) type stores 768-dimensional vectors — the output size of nomic-embed-text.

Installing Ollama and Pulling the Embedding Model

Ollama runs LLMs and embedding models locally. Install it:

brew install ollama
brew services start ollama

Pull the embedding model:

ollama pull nomic-embed-text

nomic-embed-text is a 137M parameter model that produces 768-dimensional embeddings. It’s fast, runs on CPU, and delivers solid retrieval quality.

Understanding Embeddings

Embeddings convert text into dense vectors where semantically similar texts have similar vectors:

"Python is a programming language"  →  [0.023, -0.156, 0.089, ...]
"coding language"                   →  [0.019, -0.148, 0.092, ...]
"Docker containers"                 →  [-0.234, 0.067, -0.123, ...]

The first two vectors are close (similar meaning), while the third is far away (different topic). This enables semantic search — finding documents by meaning, not just keywords.

Why Different Models Have Different Dimensions

ModelDimensionsQualitySpeed
all-MiniLM-L6-v2384GoodVery fast
nomic-embed-text768BetterFast
mxbai-embed-large1024BestMedium

Higher dimensions don’t always mean better quality. Model architecture and training data matter more. Choose based on your latency and quality requirements.

Generating and Storing Embeddings

# search.py
import os
import psycopg2
import requests
from dotenv import load_dotenv
from pgvector.psycopg2 import register_vector
from tenacity import retry, stop_after_attempt, wait_exponential

load_dotenv()

OLLAMA_URL = "http://localhost:11434/api/embeddings"
EMBEDDING_MODEL = "nomic-embed-text"
TOP_K = 3


@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def get_embedding(text: str) -> list[float]:
    """Get embedding from Ollama API with retry logic."""
    response = requests.post(
        OLLAMA_URL,
        json={"model": EMBEDDING_MODEL, "prompt": text},
        timeout=30,
    )
    response.raise_for_status()
    return response.json()["embedding"]


def search_similar_documents(query: str, top_k: int = TOP_K) -> list[dict]:
    """Search for similar documents using vector similarity."""
    query_vector = get_embedding(query)

    conn = psycopg2.connect(os.getenv("DATABASE_URL"))
    register_vector(conn)
    cur = conn.cursor()

    try:
        cur.execute(
            """
            SELECT id, content, metadata, 1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            ORDER BY embedding <=> %s::vector
            LIMIT %s;
            """,
            (query_vector, query_vector, top_k),
        )

        rows = cur.fetchall()
        return [
            {
                "id": row[0],
                "content": row[1],
                "metadata": row[2],
                "similarity": row[3],
            }
            for row in rows
        ]
    finally:
        cur.close()
        conn.close()

Key points:

  • register_vector(conn): Registers the vector type with psycopg2
  • <=> operator: Computes cosine distance (smaller = more similar)
  • 1 - distance: Converts to similarity score (higher = more similar)
  • @retry: Handles transient network failures with exponential backoff

Inserting Documents

# insert_embeddings.py
import os
import json
import psycopg2
import requests
from dotenv import load_dotenv
from pgvector.psycopg2 import register_vector

load_dotenv()

SAMPLE_DOCUMENTS = [
    {
        "content": "Python is a high-level programming language known for its simplicity.",
        "metadata": {"source": "wiki", "topic": "programming"},
    },
    {
        "content": "PostgreSQL is a powerful open-source relational database.",
        "metadata": {"source": "wiki", "topic": "database"},
    },
    {
        "content": "RAG combines retrieval with LLMs for grounded generation.",
        "metadata": {"source": "paper", "topic": "ai"},
    },
]


def main():
    # Generate embeddings
    vectors = [get_embedding(doc["content"]) for doc in SAMPLE_DOCUMENTS]

    conn = psycopg2.connect(os.getenv("DATABASE_URL"))
    register_vector(conn)
    cur = conn.cursor()

    for i, doc in enumerate(SAMPLE_DOCUMENTS):
        cur.execute(
            "INSERT INTO documents (content, metadata, embedding) VALUES (%s, %s, %s)",
            (doc["content"], json.dumps(doc["metadata"]), vectors[i]),
        )

    conn.commit()
    print(f"Inserted {len(SAMPLE_DOCUMENTS)} documents.")

Building the RAG Pipeline

Now we connect everything — retrieval + generation:

# main.py
from dotenv import load_dotenv
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from search import search_similar_documents

load_dotenv()

LLM_MODEL = "claude-haiku-4-5-20251001"
TOP_K = 3


def rag_query(question: str, top_k: int = TOP_K) -> str:
    """Complete RAG pipeline."""
    # Step 1: Retrieve relevant documents
    documents = search_similar_documents(question, top_k)

    # Step 2: Build context from documents
    context = "\n\n".join([
        f"Document {i+1} (relevance: {doc['similarity']:.2f}):\n{doc['content']}"
        for i, doc in enumerate(documents)
    ])

    # Step 3: Generate answer using LLM
    llm = ChatAnthropic(model_name=LLM_MODEL, timeout=30)

    prompt = ChatPromptTemplate.from_messages([
        ("system", """You are a helpful assistant that answers questions based on the provided context.
Use ONLY the information from the context to answer. If the context doesn't contain enough information, say so.
Be concise and direct in your answers."""),
        ("human", """Context:
{context}

Question: {question}

Answer:"""),
    ])

    chain = prompt | llm
    response = chain.invoke({"context": context, "question": question})

    return response.content

The system prompt constrains the LLM to only use retrieved context, reducing hallucinations.

Interactive CLI

Wrap it in an interactive loop:

def main():
    print("\n╔══════════════════════════════════════════════════╗")
    print("║              RAG Query Assistant                 ║")
    print("║  Type 'quit' or 'exit' to stop                   ║")
    print("╚══════════════════════════════════════════════════╝\n")

    while True:
        try:
            user_input = input("You > ").strip()
        except (EOFError, KeyboardInterrupt):
            print("\n\nGoodbye!")
            break

        if not user_input:
            continue

        if user_input.lower() in ("quit", "exit"):
            print("\nGoodbye!")
            break

        answer = rag_query(user_input)
        print(f"\nAssistant > {answer}\n")
        print("-" * 50)


if __name__ == "__main__":
    main()

Testing the Pipeline

uv run main.py
╔══════════════════════════════════════════════════╗
║              RAG Query Assistant                 ║
║  Type 'quit' or 'exit' to stop                   ║
╚══════════════════════════════════════════════════╝

You > What is RAG?

  Retrieved 3 documents:
    1. [similarity=0.7197] RAG combines retrieval with LLMs...

Assistant > RAG stands for Retrieval-Augmented Generation.
It combines retrieval with LLMs for grounded generation,
meaning it first finds relevant documents, then uses them
as context to generate accurate answers.

--------------------------------------------------
You > exit

Notice how semantic search handles synonyms and related concepts:

QueryTop MatchSimilarity
”coding language""Python is a programming language”0.70
”database management""PostgreSQL is a relational database”0.72
”postgres” (single keyword)“PostgreSQL…“0.56 (lower)

Single keywords work less well because embedding models are trained on sentences. For better keyword matching, consider hybrid search (combining full-text search with vector search).

Why Ollama for Embeddings?

AspectOllamaCloud APIs
PrivacyData stays localSent to external servers
CostFreePay per token
LatencyFast (no network)Network overhead
OfflineWorks offlineRequires internet

Ollama keeps the model warm in memory, so subsequent requests are nearly instant.

Why Not Claude for Embeddings?

Anthropic (Claude) doesn’t offer embedding models — they focus on chat/completion. Use:

  • Ollama: nomic-embed-text, mxbai-embed-large
  • OpenAI: text-embedding-3-large
  • Cohere: embed-v4
  • Google: text-embedding-004

Next Steps

  • Hybrid search: Combine keyword (tsvector) + semantic (pgvector) search
  • Chunking: Split large documents into smaller pieces before embedding
  • Reranking: Use a cross-encoder to rerank retrieved documents
  • Streaming: Stream LLM responses for better UX
  • Caching: Cache embeddings for repeated queries

Conclusion

RAG grounds LLM responses in your data, reducing hallucinations and enabling question-answering over private documents. With pgvector for storage, Ollama for local embeddings, and LangChain for orchestration, you can build production-ready RAG pipelines that run entirely on your infrastructure.

The full code is available in the learn-rag repository.

rag langchain pgvector ollama semantic-search embeddings python claude
Hoang Dang Tan Phat (Kane)

Hoang Dang Tan Phat (Kane)

Full-stack developer with 8+ years experience. Building scalable systems with Go, TypeScript, and React.