This post documents the implementation of a Retrieval-Augmented Generation (RAG) system for gold market intelligence, running entirely on a homelab Kubernetes cluster with GPU acceleration.

The Goal

Build a self-hosted AI system that:

  • Ingests gold market data from multiple sources (FRED, GoldAPI, RSS feeds)
  • Stores embeddings in a vector database
  • Provides natural language query capabilities using a local LLM
  • Runs on an NVIDIA RTX 5070 Ti GPU

Architecture

┌─────────────────────┐    ┌─────────────────────┐    ┌─────────────────────┐
│  Data Ingestion     │───▶│  Embedding Service  │───▶│      Qdrant         │
│  (CronJobs)         │    │  (nomic-embed-text) │    │  (Vector Store)     │
└─────────────────────┘    └─────────────────────┘    └──────────┬──────────┘
                                                                  │
┌─────────────────────┐    ┌─────────────────────┐               │
│   Query Service     │◀───│      Ollama         │◀──────────────┘
│   (RAG API + UI)    │    │  (Llama 3.1 8B)     │
└─────────────────────┘    └─────────────────────┘
        │                           │
        │                    ┌──────┴──────┐
        ▼                    │ RTX 5070 Ti │
   Web UI @ :80              │  (16GB)     │
                             └─────────────┘

Components

ComponentPurposeImage
OllamaLLM inference (Llama 3.1 8B) + embeddings (nomic-embed-text)ollama/ollama
QdrantVector database for storing embeddingsqdrant/qdrant
Data IngestionCronJobs fetching from FRED, GoldAPI, RSSCustom Python/FastAPI
Embedding ServiceConverts text to vectors, stores in QdrantCustom Python/FastAPI
Query ServiceRAG pipeline + web UICustom Python/FastAPI

Data Sources

SourceDataSchedule
FREDGold price history, CPI, Fed Funds Rate, 10Y Treasury, USD IndexEvery 6 hours
GoldAPI.ioReal-time XAU/USD spot priceHourly
RSS FeedsMarket news from Investing.comEvery 4 hours

Implementation

Repository Structure

gold-intelligence/
├── .gitlab-ci.yml
├── services/
│   ├── data-ingestion/
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   └── src/
│   │       ├── main.py
│   │       └── collectors/
│   │           ├── fred.py
│   │           ├── gold_api.py
│   │           └── news_rss.py
│   ├── embedding-service/
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   └── src/
│   │       ├── main.py
│   │       ├── embedder.py
│   │       └── qdrant_client.py
│   └── query-service/
│       ├── Dockerfile
│       ├── requirements.txt
│       └── src/
│           ├── main.py
│           ├── rag_pipeline.py
│           ├── ollama_client.py
│           └── static/        # Web UI
├── helm/
│   ├── data-ingestion/
│   ├── embedding-service/
│   ├── query-service/
│   ├── ollama-values.yaml
│   └── qdrant-values.yaml
└── kubernetes/
    └── argocd/

Ollama Configuration

The key to GPU acceleration is the runtimeClassName: nvidia in the Helm values:

# helm/ollama-values.yaml
replicaCount: 1

image:
  repository: ollama/ollama
  tag: latest

ollama:
  gpu:
    enabled: true
    type: nvidia
    number: 1
  models:
    pull:
      - llama3.1:8b
      - nomic-embed-text

resources:
  requests:
    cpu: 2000m
    memory: 8Gi
    nvidia.com/gpu: 1
  limits:
    cpu: 8000m
    memory: 16Gi
    nvidia.com/gpu: 1

persistence:
  enabled: true
  size: 50Gi

nodeSelector:
  nvidia.com/gpu.present: "true"

tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

runtimeClassName: nvidia

Qdrant Configuration

# helm/qdrant-values.yaml
replicaCount: 1

image:
  repository: docker.io/qdrant/qdrant
  tag: v1.13.2

persistence:
  enabled: true
  size: 20Gi

apiKey:
  enabled: true
  existingSecret: gold-intel-qdrant-api-key
  existingSecretKey: api-key

Service Configuration

Each Python service follows the same pattern with Helm charts. The key environment variables:

# helm/embedding-service/values.yaml
env:
  OLLAMA_HOST: "http://gold-intel-ollama:11434"
  QDRANT_HOST: "gold-intel-qdrant"
  QDRANT_PORT: "6333"

envSecrets:
  - name: QDRANT_API_KEY
    secretName: gold-intel-qdrant-api-key
    secretKey: api-key

RAG Pipeline

The query service implements a standard RAG pattern:

  1. Embed the user’s question using nomic-embed-text
  2. Search Qdrant for similar documents across collections
  3. Build context from search results
  4. Send context + question to Llama 3.1 8B
  5. Return response with source citations
# Simplified RAG flow
query_embedding = ollama.embed(question)
search_results = qdrant.search_multiple_collections(
    collections=["economic_data", "gold_prices", "market_news"],
    query_embedding=query_embedding,
    top_k=5
)
context = build_context(search_results)
response = ollama.generate(
    prompt=f"Context:\n{context}\n\nQuestion: {question}",
    system_prompt=SYSTEM_PROMPT
)

Web UI

The query service includes a static HTML/CSS/JS frontend served by FastAPI:

from fastapi.staticfiles import StaticFiles
from fastapi.responses import FileResponse

static_dir = Path(__file__).parent / "static"
app.mount("/static", StaticFiles(directory=str(static_dir)), name="static")

@app.get("/")
async def root():
    return FileResponse(str(static_dir / "index.html"))

GitLab CI Pipeline

stages:
  - test
  - build

.test-template:
  stage: test
  image: python:3.12-slim
  before_script:
    - pip install -r requirements.txt
  script:
    - python -m py_compile src/*.py
  allow_failure: true

.build-template:
  stage: build
  image: docker:24.0.5
  services:
    - docker:24.0.5-dind
  variables:
    DOCKER_TLS_CERTDIR: "/certs"
  before_script:
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
  script:
    - docker build -t $CI_REGISTRY_IMAGE/$SERVICE:$CI_COMMIT_SHORT_SHA .
    - docker push $CI_REGISTRY_IMAGE/$SERVICE:$CI_COMMIT_SHORT_SHA

Kubernetes Secrets

Three secrets are required:

# API keys for data sources
kubectl create secret generic gold-intel-api-keys -n gold-intelligence \
  --from-literal=fred-api-key=<FRED_KEY> \
  --from-literal=gold-api-key=<GOLDAPI_KEY>

# Qdrant API key
kubectl create secret generic gold-intel-qdrant-api-key -n gold-intelligence \
  --from-literal=api-key=$(openssl rand -hex 32)

# GitLab registry credentials
kubectl create secret docker-registry gitlab-registry -n gold-intelligence \
  --docker-server=registry.gitlab.com \
  --docker-username=<DEPLOY_TOKEN_USER> \
  --docker-password=<DEPLOY_TOKEN>

Exposing the Service

The query service is exposed via MetalLB LoadBalancer:

# helm/query-service/values.yaml
service:
  type: LoadBalancer
  port: 80
  loadBalancerIP: "192.168.2.225"
  annotations:
    metallb.universe.tf/loadBalancerIPs: "192.168.2.225"

DNS entry added via OPNsense API:

./scripts/opnsense-dns.sh add gold-intel 192.168.2.225 "Gold Intelligence RAG Service"

Issues Encountered

Qdrant Client Defaults to HTTPS

When an API key is provided, the qdrant-client library defaults to HTTPS. For internal cluster communication over HTTP, this must be explicitly disabled:

client = QdrantClient(
    host=settings.qdrant_host,
    port=settings.qdrant_port,
    api_key=settings.qdrant_api_key,
    https=False,  # Required for internal cluster communication
)

Ollama GPU Detection

Ollama was showing 0B VRAM until runtimeClassName: nvidia was added to the Helm values. The NVIDIA GPU operator must be properly configured on the cluster.

Service Name Resolution

All services are prefixed with the Helm release name (e.g., gold-intel-ollama not ollama). Environment variables must use the full service names for DNS resolution to work.

Helm/ArgoCD Image Tag Management

ArgoCD would revert manually-set image tags. The solution was to pin specific image tags in Helm values and redeploy through ArgoCD.

Resource Usage

ComponentCPU RequestMemory RequestGPUStorage
Ollama2000m8Gi150Gi
Qdrant200m512Mi-20Gi
Data Ingestion100m256Mi--
Embedding Service200m512Mi--
Query Service100m256Mi--

Result

The system is accessible at http://gold-intel.minoko.life with:

  • Web UI for submitting natural language queries
  • ~4 second response time (after model warmup)
  • Automatic data refresh via CronJobs
  • Source citations in responses
  • API documentation at /docs

Example query response:

{
  "query": "What is the current gold price?",
  "response": "Based on the GoldAPI data, the current gold spot price as of 2026-01-03 is $4332.32 per troy ounce...",
  "sources": [
    {
      "collection": "gold_prices",
      "text": "Gold spot price on 2026-01-03: $4332.32 per troy ounce...",
      "score": 0.6804
    }
  ],
  "model": "llama3.1:8b",
  "latency_ms": 4275
}