How the Garden Server Came Back

The failure started politely.

At 2:14 a.m., the garden server stopped responding to pings, but it kept blinking like nothing was wrong. That combination — silence with confidence — is how infrastructure gets dramatic.

First clues

We found:

healthy power
broken DNS cache
one update job that had finished almost successfully

The most useful note on the incident page was just this sentence:

If the machine looks calm and the system looks chaotic, trust the system.

Recovery path

Reboot the resolver.
Confirm local routes.
Rebuild the stale cache files.
Test one dependency at a time.

Commands we actually wished we had sooner

systemctl status unbound
journalctl -u unbound --since "30 minutes ago"
resolvectl query api.internal

The part nobody remembers to write down

After the service returned, we added a tiny runbook:

where logs live
which restart is safe
which restart is not safe
who to message when the failure smells unfamiliar

That fourth bullet saves more time than most tooling.

A comparison table

Option	Speed	Confidence	Comment
Reboot everything	Fast	Low	Feels good, teaches little
Trace one layer at a time	Medium	High	Slower start, better finish
Wait and hope	Variable	None	Famous last strategy

Closing thought

I like incidents that leave the team gentler with each other and stricter with the system. The best recoveries fix both.