Back to Insights

Building a RAG-Powered API for Kubernetes Troubleshooting with K8sGPT

In this post, we’ll explore how to serve the RAG system as a REST API and integrate it with K8sGPT. This setup enables Kubernetes users to access enhanced troubleshooting insights based on structured, official documentation.

RAG-Powered API

Running the RAG API Server

Our RAG API leverages the vector database we built earlier and processes user queries using ChromaDB, LangChain, and OpenAI. It exposes a simple REST API endpoint that takes a query and returns a response enriched with Kubernetes documentation.

Quick Start

To run the server:

Building a RAG-Powered API for Kubernetes Troubleshooting with K8sGPT

docker run -e OPENAI_API_KEY="<your-openai-key>" \
  --volume /tmp/testcontainer:/tmp/testcontainer \
  -p 8000:8000 \
  --name rag_server_api rag_server_demo

Test it using curl:

curl -X 'POST' \
  'http://0.0.0.0:8000/completion' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "query": "What is Kubernetes?"
}' | jq

Sample response:

{
  "rag_response": "Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services..."
}

Expanding Knowledge Sources

Our RAG system is flexible, it can easily incorporate additional knowledge sources, such as:

  • Customer-specific documentation
  • Code repositories
  • Logs and monitoring data
  • Kubernetes audit and event logs

This extensibility makes RAG an ideal solution for advanced Kubernetes troubleshooting beyond generic LLM responses.

Enhancing K8sGPT with RAG

K8sGPT is an AI-powered Kubernetes troubleshooting tool that analyzes clusters and provides insights using Go-based analyzers. These analyzers detect common Kubernetes failure patterns and deliver structured diagnostic data.

By integrating our RAG API, K8sGPT can:

  • Retrieve structured knowledge from Kubernetes documentation.
  • Provide richer context when diagnosing issues.
  • Minimize hallucinations by grounding responses in official documentation.

Setting Up K8sGPT with RAG

First, create a local Kubernetes cluster using Kind:

kind create cluster

Next, deploy a broken pod to test troubleshooting capabilities:

apiVersion: v1
kind: Pod
metadata:
  name: broken-pod
  namespace: default
spec:
  containers:
    - name: broken-pod
      image: nginx:1.a.b.c # Invalid image tag
      livenessProbe:
        httpGet:
          path: /
          port: 81
        initialDelaySeconds: 3
        periodSeconds: 3

Apply the configuration:

kubectl apply -f broken-pod.yml

Clone and build the K8sGPT project with RAG integration:

git clone https://github.com/elieser1101/k8sgpt.git
cd k8sgpt
git checkout CustomRagAIClient
make build

This fork in the branch CustomRagAIClient contains a custom AI provider that allows K8sGPT to communicate with the RAG API. That’s why you’ll need to build the application from source.

Please feel free to explore the code to understand what happens under the hood.

Using K8sGPT’s Analyze and Explain Features

NOTE: Ensure you use the recently built k8sgpt, should be on ./bin/k8sgpt

Run the analyze command:

Building a RAG-Powered API for Kubernetes Troubleshooting with K8sGPT

k8sgpt analyze --filter Pod

Sample output:

AI Provider: AI not used; --explain not set

0: Pod default/broken-pod
- Error: Back-off pulling image "nginx:1.a.b.c"

It shows the error from pulling the image. You can also use the explain feature. This command utilizes any LLM provider (OpenAI, Llama, etc.) to explain the problem in natural language, based on the analyze results:

Building a RAG-Powered API for Kubernetes Troubleshooting with K8sGPT

K8sGPT supports custom AI backends, allowing us to connect it to our RAG API.

Final Architecture

Here’s the architecture of the complete system:

Building a RAG-Powered API for Kubernetes Troubleshooting with K8sGPT

Running K8sGPT with the Custom RAG Backend

Add the custom RAG backend and run an analysis:

k8sgpt auth add --backend customrag
k8sgpt analyze --filter Pod --explain --backend customrag --no-cache

Example output:

0: Pod default/broken-pod
- Error: Back-off pulling image "nginx:1.a.b.c"
- Solution:
  1. Check the image tag for typos.
  2. Use a valid version (e.g., "nginx:1.16.1").
  3. Run `kubectl set image deployment/nginx nginx=<correct_image_name>`.

How It Works

Here’s what happens under the hood:

  1. K8sGPT analyzes the cluster.
  2. As part of explain it sends an HTTP request to the RAG API.
  3. The RAG API retrieves relevant context/documentation from the vector database via LangChain.
  4. OpenAI processes the prompt, combining the analyze results and relevant documentation.
  5. K8sGPT displays accurate, contextualized troubleshooting insights.

Next Steps & Future Improvements

This system is already a powerful Kubernetes troubleshooting tool, but here’s how it can be improved:

  1. Deploy K8sGPT as a Kubernetes Operator to provide continuous monitoring and proactive alerts.
  2. Self-host an open-source LLM to reduce API costs and improve data privacy.
  3. Measure accuracy to minimize hallucinations and validate recommendations.
  4. Enable auto-remediation by integrating K8sGPT with Kubernetes controllers for self-healing clusters.
  5. Adopt the Model Context Protocol to standardize LLM context-sharing across tools.
  6. Package the solution as a SaaS product for small business with limited access to DevOps and SRE teams.

Conclusion

We’ve built a fully functional RAG system that enhances Kubernetes troubleshooting by combining:

✅ Kubernetes documentation embeddings
✅ A REST API powered by LangChain
✅ K8sGPT’s diagnostic capabilities

This combination makes Kubernetes issue resolution faster, more accurate, and grounded in official knowledge. With further development, this can evolve into a production-ready tool for SREs and platform engineers.

Explore our Cloud, SRE, DevOps & Cybersecurity solutions

With our Cloud, SRE, and DevOps Studio embrace cloud-native solutions for accelerated development, combined with reliable, secure, and scalable environments.

Explore more
Elieser Pereira
Elieser Pereira

By Elieser Pereira

SRE Studio Engineer at Qubika

Elieser is a SRE Studio Engineer at Qubika. With a background in software development and DevOps, he focuses on building reliable, scalable systems using tools like Go and Kubernetes. He is a Kubernetes contributor and is part of the 1.33 Kubernetes Release team. Recently working on machine learning and cloud related projects, with a strong interest in the intersection of math and computer science. Passionate about automation, CI/CD pipelines, and technologies that empower developers to ship better software. Elieser is always experimenting with new open source projects to sharpen technical skills.

News and things that inspire us

Receive regular updates about our latest work

Let’s work together

Get in touch with our experts to review your idea or product, and discuss options for the best approach

Get in touch