Observability Toolkit

The problem

Monitoring is easy to demo and hard to trust. A pretty Grafana dashboard that you may have sunk a bunch of time into does not prove anyone gets paged when something breaks, or that the page is worth acting on.

What I built

Observability Toolkit is a runnable demo you can spin up with Docker Compose on a laptop. No real database, queue, or production app required.

A Go custom Prometheus collector simulates the application metrics that often matter before CPU alerts do: database connection pool pressure, queue backlog, and cache hit rate. Prometheus scrapes those numbers on a schedule, Grafana charts them, and alert rules fire when they miss Service Level Objective (SLO)-style health targets.

Python chaos scripts then deliberately break things: spike load, kill the exporter, stress resources. You can watch the full chain: something fails, a metric moves, a rule breaches, an alert fires. That is how you prove monitoring works end to end, not just that the dashboard looks good in a screenshot.

Use it when learning observability, prototyping alert rules before production, or checking that on-call will actually get paged for the failures you care about.

Connection to my day job

Production work included the Edge Observability with Prometheus Cloudflare exporter and tightening alert quality. This repo packages those lessons into something portable: pull-based metrics, Service Level Objective (SLO)-style recording rules, and deliberate failure injection to validate signal over noise.

What I learned

Alert design is product design. Label choices and recovery alerts matter as much as exporter code. If you claim monitoring works, you should break things on purpose and watch the alerts fire.

Infrastructure metrics (CPU, memory, disk) are lagging indicators. A growing queue backlog means work is piling up before CPU looks busy. A nearly full connection pool means requests wait even when servers look idle. A dropping cache hit rate pushes more load to the database. Those application signals often warn you earlier than infrastructure charts. Define Service Level Objective (SLO)-style targets on those signals first. “Queue under 100” is actionable; “queue looks high” is not.

Prometheus pull-based scraping keeps apps simple: expose a /metrics endpoint and let Prometheus come to you. That separation makes the collector easy to test in isolation before wiring it into a bigger stack.

Repo

Full source and design notes are on GitHub.