Kube Pod Self-Healer

The problem

I have run into the same Kubernetes (K8s) failures both at work and in my homelab: crash loops, out-of-memory kills, failed health checks, images that will not pull. The fix usually revolves around deleting the pod and letting the cluster recreate it, scaling up, or alerting the team.

What I built

Kube Pod Self-Healer was initially a proof of concept of that workflow. A Go health agent watches pod status. When it sees a known failure pattern, it sends the event to a Python remediation service that applies bounded fixes (restart the pod, scale replicas, clear a cache) under least-privilege permissions. Terraform and Kind (Kubernetes in Docker) make the whole stack reproducible on a laptop.

Think of it as a tireless operator for boring, repetitive incidents so humans can focus on judgment calls.

Connection to my day job

At work I automate incident response, queue management, and deployment safety nets. I wanted to experiment openly with self-healing: how much automation helps before it becomes dangerous, and how to keep detection separate from action so each layer stays testable.

What I learned

The agent needs to know when pods fail without constantly hammering the Kubernetes (K8s) API. Informers (watchers that keep a local copy of cluster state) let you react to changes instead of polling every few seconds.

Automated fixes should be safe to run twice and logged every time. If “restart pod” runs on something already healthy, nothing bad should happen, and you should always be able to review what the system did and why.

Self-healing is a dial, not a switch. The goal is faster recovery on boring incidents, not removing humans from every decision.

Repo

Full source and design notes are on GitHub.