Building a GPU-Accelerated RAG System for Gold Market Intelligence

This post documents the implementation of a Retrieval-Augmented Generation (RAG) system for gold market intelligence, running entirely on a homelab Kubernetes cluster with GPU acceleration.

The Goal

Build a self-hosted AI system that:

Ingests gold market data from multiple sources (FRED, GoldAPI, RSS feeds)
Stores embeddings in a vector database
Provides natural language query capabilities using a local LLM
Runs on an NVIDIA RTX 5070 Ti GPU

Architecture

┌─────────────────────┐    ┌─────────────────────┐    ┌─────────────────────┐
│  Data Ingestion     │───▶│  Embedding Service  │───▶│      Qdrant         │
│  (CronJobs)         │    │  (nomic-embed-text) │    │  (Vector Store)     │
└─────────────────────┘    └─────────────────────┘    └──────────┬──────────┘
                                                                  │
┌─────────────────────┐    ┌─────────────────────┐               │
│   Query Service     │◀───│      Ollama         │◀──────────────┘
│   (RAG API + UI)    │    │  (Llama 3.1 8B)     │
└─────────────────────┘    └─────────────────────┘
        │                           │
        │                    ┌──────┴──────┐
        ▼                    │ RTX 5070 Ti │
   Web UI @ :80              │  (16GB)     │
                             └─────────────┘

Components

Component	Purpose	Image
Ollama	LLM inference (Llama 3.1 8B) + embeddings (nomic-embed-text)	ollama/ollama
Qdrant	Vector database for storing embeddings	qdrant/qdrant
Data Ingestion	CronJobs fetching from FRED, GoldAPI, RSS	Custom Python/FastAPI
Embedding Service	Converts text to vectors, stores in Qdrant	Custom Python/FastAPI
Query Service	RAG pipeline + web UI	Custom Python/FastAPI

Data Sources

Source	Data	Schedule
FRED	Gold price history, CPI, Fed Funds Rate, 10Y Treasury, USD Index	Every 6 hours
GoldAPI.io	Real-time XAU/USD spot price	Hourly
RSS Feeds	Market news from Investing.com	Every 4 hours

Implementation

Repository Structure

gold-intelligence/
├── .gitlab-ci.yml
├── services/
│   ├── data-ingestion/
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   └── src/
│   │       ├── main.py
│   │       └── collectors/
│   │           ├── fred.py
│   │           ├── gold_api.py
│   │           └── news_rss.py
│   ├── embedding-service/
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   └── src/
│   │       ├── main.py
│   │       ├── embedder.py
│   │       └── qdrant_client.py
│   └── query-service/
│       ├── Dockerfile
│       ├── requirements.txt
│       └── src/
│           ├── main.py
│           ├── rag_pipeline.py
│           ├── ollama_client.py
│           └── static/        # Web UI
├── helm/
│   ├── data-ingestion/
│   ├── embedding-service/
│   ├── query-service/
│   ├── ollama-values.yaml
│   └── qdrant-values.yaml
└── kubernetes/
    └── argocd/

Ollama Configuration

The key to GPU acceleration is the runtimeClassName: nvidia in the Helm values:

# helm/ollama-values.yaml
replicaCount: 1

image:
  repository: ollama/ollama
  tag: latest

ollama:
  gpu:
    enabled: true
    type: nvidia
    number: 1
  models:
    pull:
      - llama3.1:8b
      - nomic-embed-text

resources:
  requests:
    cpu: 2000m
    memory: 8Gi
    nvidia.com/gpu: 1
  limits:
    cpu: 8000m
    memory: 16Gi
    nvidia.com/gpu: 1

persistence:
  enabled: true
  size: 50Gi

nodeSelector:
  nvidia.com/gpu.present: "true"

tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

runtimeClassName: nvidia

Qdrant Configuration

# helm/qdrant-values.yaml
replicaCount: 1

image:
  repository: docker.io/qdrant/qdrant
  tag: v1.13.2

persistence:
  enabled: true
  size: 20Gi

apiKey:
  enabled: true
  existingSecret: gold-intel-qdrant-api-key
  existingSecretKey: api-key

Service Configuration

Each Python service follows the same pattern with Helm charts. The key environment variables:

# helm/embedding-service/values.yaml
env:
  OLLAMA_HOST: "http://gold-intel-ollama:11434"
  QDRANT_HOST: "gold-intel-qdrant"
  QDRANT_PORT: "6333"

envSecrets:
  - name: QDRANT_API_KEY
    secretName: gold-intel-qdrant-api-key
    secretKey: api-key

RAG Pipeline

The query service implements a standard RAG pattern:

Embed the user’s question using nomic-embed-text
Search Qdrant for similar documents across collections
Build context from search results
Send context + question to Llama 3.1 8B
Return response with source citations

# Simplified RAG flow
query_embedding = ollama.embed(question)
search_results = qdrant.search_multiple_collections(
    collections=["economic_data", "gold_prices", "market_news"],
    query_embedding=query_embedding,
    top_k=5
)
context = build_context(search_results)
response = ollama.generate(
    prompt=f"Context:\n{context}\n\nQuestion: {question}",
    system_prompt=SYSTEM_PROMPT
)

Web UI

The query service includes a static HTML/CSS/JS frontend served by FastAPI:

from fastapi.staticfiles import StaticFiles
from fastapi.responses import FileResponse

static_dir = Path(__file__).parent / "static"
app.mount("/static", StaticFiles(directory=str(static_dir)), name="static")

@app.get("/")
async def root():
    return FileResponse(str(static_dir / "index.html"))

GitLab CI Pipeline

stages:
  - test
  - build

.test-template:
  stage: test
  image: python:3.12-slim
  before_script:
    - pip install -r requirements.txt
  script:
    - python -m py_compile src/*.py
  allow_failure: true

.build-template:
  stage: build
  image: docker:24.0.5
  services:
    - docker:24.0.5-dind
  variables:
    DOCKER_TLS_CERTDIR: "/certs"
  before_script:
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
  script:
    - docker build -t $CI_REGISTRY_IMAGE/$SERVICE:$CI_COMMIT_SHORT_SHA .
    - docker push $CI_REGISTRY_IMAGE/$SERVICE:$CI_COMMIT_SHORT_SHA

Kubernetes Secrets

Three secrets are required:

# API keys for data sources
kubectl create secret generic gold-intel-api-keys -n gold-intelligence \
  --from-literal=fred-api-key=<FRED_KEY> \
  --from-literal=gold-api-key=<GOLDAPI_KEY>

# Qdrant API key
kubectl create secret generic gold-intel-qdrant-api-key -n gold-intelligence \
  --from-literal=api-key=$(openssl rand -hex 32)

# GitLab registry credentials
kubectl create secret docker-registry gitlab-registry -n gold-intelligence \
  --docker-server=registry.gitlab.com \
  --docker-username=<DEPLOY_TOKEN_USER> \
  --docker-password=<DEPLOY_TOKEN>

Exposing the Service

The query service is exposed via MetalLB LoadBalancer:

# helm/query-service/values.yaml
service:
  type: LoadBalancer
  port: 80
  loadBalancerIP: "192.168.2.225"
  annotations:
    metallb.universe.tf/loadBalancerIPs: "192.168.2.225"

DNS entry added via OPNsense API:

./scripts/opnsense-dns.sh add gold-intel 192.168.2.225 "Gold Intelligence RAG Service"

Issues Encountered

Qdrant Client Defaults to HTTPS

When an API key is provided, the qdrant-client library defaults to HTTPS. For internal cluster communication over HTTP, this must be explicitly disabled:

client = QdrantClient(
    host=settings.qdrant_host,
    port=settings.qdrant_port,
    api_key=settings.qdrant_api_key,
    https=False,  # Required for internal cluster communication
)

Ollama GPU Detection

Ollama was showing 0B VRAM until runtimeClassName: nvidia was added to the Helm values. The NVIDIA GPU operator must be properly configured on the cluster.

Service Name Resolution

All services are prefixed with the Helm release name (e.g., gold-intel-ollama not ollama). Environment variables must use the full service names for DNS resolution to work.

Helm/ArgoCD Image Tag Management

ArgoCD would revert manually-set image tags. The solution was to pin specific image tags in Helm values and redeploy through ArgoCD.

Resource Usage

Component	CPU Request	Memory Request	GPU	Storage
Ollama	2000m	8Gi	1	50Gi
Qdrant	200m	512Mi	-	20Gi
Data Ingestion	100m	256Mi	-	-
Embedding Service	200m	512Mi	-	-
Query Service	100m	256Mi	-	-

Result

The system is accessible at http://gold-intel.minoko.life with:

Web UI for submitting natural language queries
~4 second response time (after model warmup)
Automatic data refresh via CronJobs
Source citations in responses
API documentation at /docs

Example query response:

{
  "query": "What is the current gold price?",
  "response": "Based on the GoldAPI data, the current gold spot price as of 2026-01-03 is $4332.32 per troy ounce...",
  "sources": [
    {
      "collection": "gold_prices",
      "text": "Gold spot price on 2026-01-03: $4332.32 per troy ounce...",
      "score": 0.6804
    }
  ],
  "model": "llama3.1:8b",
  "latency_ms": 4275
}

The Goal#

Architecture#

Components#

Data Sources#

Implementation#

Repository Structure#

Ollama Configuration#

Qdrant Configuration#

Service Configuration#

RAG Pipeline#

Web UI#

GitLab CI Pipeline#

Kubernetes Secrets#

Exposing the Service#

Issues Encountered#

Qdrant Client Defaults to HTTPS#

Ollama GPU Detection#

Service Name Resolution#

Helm/ArgoCD Image Tag Management#

Resource Usage#

Result#

Links#