This post documents the implementation of a Retrieval-Augmented Generation (RAG) system for gold market intelligence, running entirely on a homelab Kubernetes cluster with GPU acceleration.
The Goal
Build a self-hosted AI system that:
- Ingests gold market data from multiple sources (FRED, GoldAPI, RSS feeds)
- Stores embeddings in a vector database
- Provides natural language query capabilities using a local LLM
- Runs on an NVIDIA RTX 5070 Ti GPU
Architecture
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ Data Ingestion │───▶│ Embedding Service │───▶│ Qdrant │
│ (CronJobs) │ │ (nomic-embed-text) │ │ (Vector Store) │
└─────────────────────┘ └─────────────────────┘ └──────────┬──────────┘
│
┌─────────────────────┐ ┌─────────────────────┐ │
│ Query Service │◀───│ Ollama │◀──────────────┘
│ (RAG API + UI) │ │ (Llama 3.1 8B) │
└─────────────────────┘ └─────────────────────┘
│ │
│ ┌──────┴──────┐
▼ │ RTX 5070 Ti │
Web UI @ :80 │ (16GB) │
└─────────────┘
Components
| Component | Purpose | Image |
|---|---|---|
| Ollama | LLM inference (Llama 3.1 8B) + embeddings (nomic-embed-text) | ollama/ollama |
| Qdrant | Vector database for storing embeddings | qdrant/qdrant |
| Data Ingestion | CronJobs fetching from FRED, GoldAPI, RSS | Custom Python/FastAPI |
| Embedding Service | Converts text to vectors, stores in Qdrant | Custom Python/FastAPI |
| Query Service | RAG pipeline + web UI | Custom Python/FastAPI |
Data Sources
| Source | Data | Schedule |
|---|---|---|
| FRED | Gold price history, CPI, Fed Funds Rate, 10Y Treasury, USD Index | Every 6 hours |
| GoldAPI.io | Real-time XAU/USD spot price | Hourly |
| RSS Feeds | Market news from Investing.com | Every 4 hours |
Implementation
Repository Structure
gold-intelligence/
├── .gitlab-ci.yml
├── services/
│ ├── data-ingestion/
│ │ ├── Dockerfile
│ │ ├── requirements.txt
│ │ └── src/
│ │ ├── main.py
│ │ └── collectors/
│ │ ├── fred.py
│ │ ├── gold_api.py
│ │ └── news_rss.py
│ ├── embedding-service/
│ │ ├── Dockerfile
│ │ ├── requirements.txt
│ │ └── src/
│ │ ├── main.py
│ │ ├── embedder.py
│ │ └── qdrant_client.py
│ └── query-service/
│ ├── Dockerfile
│ ├── requirements.txt
│ └── src/
│ ├── main.py
│ ├── rag_pipeline.py
│ ├── ollama_client.py
│ └── static/ # Web UI
├── helm/
│ ├── data-ingestion/
│ ├── embedding-service/
│ ├── query-service/
│ ├── ollama-values.yaml
│ └── qdrant-values.yaml
└── kubernetes/
└── argocd/
Ollama Configuration
The key to GPU acceleration is the runtimeClassName: nvidia in the Helm values:
# helm/ollama-values.yaml
replicaCount: 1
image:
repository: ollama/ollama
tag: latest
ollama:
gpu:
enabled: true
type: nvidia
number: 1
models:
pull:
- llama3.1:8b
- nomic-embed-text
resources:
requests:
cpu: 2000m
memory: 8Gi
nvidia.com/gpu: 1
limits:
cpu: 8000m
memory: 16Gi
nvidia.com/gpu: 1
persistence:
enabled: true
size: 50Gi
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
runtimeClassName: nvidia
Qdrant Configuration
# helm/qdrant-values.yaml
replicaCount: 1
image:
repository: docker.io/qdrant/qdrant
tag: v1.13.2
persistence:
enabled: true
size: 20Gi
apiKey:
enabled: true
existingSecret: gold-intel-qdrant-api-key
existingSecretKey: api-key
Service Configuration
Each Python service follows the same pattern with Helm charts. The key environment variables:
# helm/embedding-service/values.yaml
env:
OLLAMA_HOST: "http://gold-intel-ollama:11434"
QDRANT_HOST: "gold-intel-qdrant"
QDRANT_PORT: "6333"
envSecrets:
- name: QDRANT_API_KEY
secretName: gold-intel-qdrant-api-key
secretKey: api-key
RAG Pipeline
The query service implements a standard RAG pattern:
- Embed the user’s question using nomic-embed-text
- Search Qdrant for similar documents across collections
- Build context from search results
- Send context + question to Llama 3.1 8B
- Return response with source citations
# Simplified RAG flow
query_embedding = ollama.embed(question)
search_results = qdrant.search_multiple_collections(
collections=["economic_data", "gold_prices", "market_news"],
query_embedding=query_embedding,
top_k=5
)
context = build_context(search_results)
response = ollama.generate(
prompt=f"Context:\n{context}\n\nQuestion: {question}",
system_prompt=SYSTEM_PROMPT
)
Web UI
The query service includes a static HTML/CSS/JS frontend served by FastAPI:
from fastapi.staticfiles import StaticFiles
from fastapi.responses import FileResponse
static_dir = Path(__file__).parent / "static"
app.mount("/static", StaticFiles(directory=str(static_dir)), name="static")
@app.get("/")
async def root():
return FileResponse(str(static_dir / "index.html"))
GitLab CI Pipeline
stages:
- test
- build
.test-template:
stage: test
image: python:3.12-slim
before_script:
- pip install -r requirements.txt
script:
- python -m py_compile src/*.py
allow_failure: true
.build-template:
stage: build
image: docker:24.0.5
services:
- docker:24.0.5-dind
variables:
DOCKER_TLS_CERTDIR: "/certs"
before_script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
script:
- docker build -t $CI_REGISTRY_IMAGE/$SERVICE:$CI_COMMIT_SHORT_SHA .
- docker push $CI_REGISTRY_IMAGE/$SERVICE:$CI_COMMIT_SHORT_SHA
Kubernetes Secrets
Three secrets are required:
# API keys for data sources
kubectl create secret generic gold-intel-api-keys -n gold-intelligence \
--from-literal=fred-api-key=<FRED_KEY> \
--from-literal=gold-api-key=<GOLDAPI_KEY>
# Qdrant API key
kubectl create secret generic gold-intel-qdrant-api-key -n gold-intelligence \
--from-literal=api-key=$(openssl rand -hex 32)
# GitLab registry credentials
kubectl create secret docker-registry gitlab-registry -n gold-intelligence \
--docker-server=registry.gitlab.com \
--docker-username=<DEPLOY_TOKEN_USER> \
--docker-password=<DEPLOY_TOKEN>
Exposing the Service
The query service is exposed via MetalLB LoadBalancer:
# helm/query-service/values.yaml
service:
type: LoadBalancer
port: 80
loadBalancerIP: "192.168.2.225"
annotations:
metallb.universe.tf/loadBalancerIPs: "192.168.2.225"
DNS entry added via OPNsense API:
./scripts/opnsense-dns.sh add gold-intel 192.168.2.225 "Gold Intelligence RAG Service"
Issues Encountered
Qdrant Client Defaults to HTTPS
When an API key is provided, the qdrant-client library defaults to HTTPS. For internal cluster communication over HTTP, this must be explicitly disabled:
client = QdrantClient(
host=settings.qdrant_host,
port=settings.qdrant_port,
api_key=settings.qdrant_api_key,
https=False, # Required for internal cluster communication
)
Ollama GPU Detection
Ollama was showing 0B VRAM until runtimeClassName: nvidia was added to the Helm values. The NVIDIA GPU operator must be properly configured on the cluster.
Service Name Resolution
All services are prefixed with the Helm release name (e.g., gold-intel-ollama not ollama). Environment variables must use the full service names for DNS resolution to work.
Helm/ArgoCD Image Tag Management
ArgoCD would revert manually-set image tags. The solution was to pin specific image tags in Helm values and redeploy through ArgoCD.
Resource Usage
| Component | CPU Request | Memory Request | GPU | Storage |
|---|---|---|---|---|
| Ollama | 2000m | 8Gi | 1 | 50Gi |
| Qdrant | 200m | 512Mi | - | 20Gi |
| Data Ingestion | 100m | 256Mi | - | - |
| Embedding Service | 200m | 512Mi | - | - |
| Query Service | 100m | 256Mi | - | - |
Result
The system is accessible at http://gold-intel.minoko.life with:
- Web UI for submitting natural language queries
- ~4 second response time (after model warmup)
- Automatic data refresh via CronJobs
- Source citations in responses
- API documentation at
/docs
Example query response:
{
"query": "What is the current gold price?",
"response": "Based on the GoldAPI data, the current gold spot price as of 2026-01-03 is $4332.32 per troy ounce...",
"sources": [
{
"collection": "gold_prices",
"text": "Gold spot price on 2026-01-03: $4332.32 per troy ounce...",
"score": 0.6804
}
],
"model": "llama3.1:8b",
"latency_ms": 4275
}