Affected
Operational from 10:02 PM to 10:02 PM, Major outage from 10:02 PM to 5:18 AM
- PostmortemPostmortem
Postmortem: climateai.org PDS outage
Summary
The climateai.org PDS went down after the Caddy reverse proxy was repeatedly killed by the Linux OOM killer. Caddy memory usage grew to roughly 2.8–2.9 GB RSS, exhausting available VM memory and causing the host to become unstable.
Root Cause
Historically, before
/tls-checkwas added, Caddy on-demand TLS was able to issue certificates for invalid or nested subdomains. This resulted in a large number of stale certificates being stored in Caddy’s certificate storage.Although
/tls-checkhas been enabled for quite some time and now correctly rejects invalid nested domains, the certificates issued before that protection existed remained in Caddy’s storage.During Caddy certificate maintenance or renewal activity, Caddy processed this large stale certificate store. That caused memory usage to spike high enough for the kernel to OOM-kill the Caddy process.
Impact
Public HTTPS access to the PDS became unavailable.
SSH access also became unreliable while the VM was under memory pressure.
PDS account data, DIDs, repositories, records, and handles were not deleted or corrupted.
The issue was limited to Caddy/TLS handling and VM memory exhaustion.
Why it kept restarting
The Caddy container was configured with Docker’s
unless-stoppedrestart policy. After each OOM kill, Docker restarted Caddy automatically. Because the stale certificate storage was still present, Caddy kept hitting the same memory pressure and was killed again. This created a repeated restart/OOM loop and eventually left Docker/containerd in a noisy cleanup state.Remediation
Precautions have now been taken to prevent recurrence:
Removed stale invalid/nested-domain certificate entries from Caddy storage.
Kept valid certificates for:
climateai.orgauth.climateai.orgvalid one-label handle subdomains
Confirmed current
/tls-checkrejects invalid nested domains.Added or started adding a memory limit for the Caddy container so it cannot consume enough RAM to destabilize the full VM.
- ResolvedResolvedauth.climateai.org ( auth service for climateai.org ) is back up. This incident was automatically resolved by Instatus monitoring.
- InvestigatingInvestigatingauth.climateai.org ( auth service for climateai.org ) is down at the moment. This incident was automatically created by Instatus monitoring.
