I've been reading incident reports and postmortems. The same five cron jobs show up over and over. Not exotic edge cases — the boring, critical jobs that everyone sets up, everyone forgets about, and everyone regrets when they fail.
Every company has one. 0 2 * * * /backup.sh. It ran fine for 14 months, then the disk filled up and the backup script started writing zero-byte files. Exit code 0. No error. Three weeks later someone needs to restore and discovers the last good backup is from November.
Why it breaks: Backup scripts almost never validate their own output. They run, they finish, they exit clean. The failure is semantic, not syntactic — the script "succeeded" at producing garbage.
What to do: Ping your monitoring only after verifying the output exists and is non-trivial:
#!/bin/bash
set -e
pg_dump mydb | gzip > /backups/db-$(date +%Y%m%d).sql.gz
FILESIZE=$(stat -f%z /backups/db-$(date +%Y%m%d).sql.gz 2>/dev/null || stat -c%s /backups/db-$(date +%Y%m%d).sql.gz)
[ "$FILESIZE" -gt 1000 ] && curl -fsS https://ping.trebben.dk/YOUR_SLUG
Let's Encrypt made free TLS mainstream. It also created a new class of incident: the cert that auto-renewed for 89 days, then didn't. The cron job is still there. It just started failing silently when the DNS provider changed their API, or when certbot updated and the CLI flags shifted.
Why it breaks: Certificate renewal runs so rarely (every 60-90 days) that you forget it exists. When something in the environment changes — a package update, a DNS migration, a firewall rule — the renewal cron breaks and you have 30 days of silence before the cert expires and your site goes down.
What to do: Monitor the renewal job, but also monitor the cert expiry independently. Belt and suspenders:
# Monitor the renewal job
0 3 * * 1 certbot renew && curl -fsS https://ping.trebben.dk/cert-renewal
# Also: check actual cert expiry weekly (separate monitor)
0 9 * * 3 openssl s_client -connect yoursite.com:443 < /dev/null 2>/dev/null | \
openssl x509 -noout -checkend 604800 && curl -fsS https://ping.trebben.dk/cert-check
Log rotation, temp file cleanup, old session purging. These run fine until the day they don't, and then /tmp fills up, the database grows to 200GB, or your log partition hits 100% and takes down the application server.
Why it breaks: Cleanup jobs have a unique failure mode — when they stop running, nothing immediately breaks. The system gets slowly worse over days or weeks. By the time you notice, the damage is significant and the fix is an emergency.
What to do: These are the jobs that most benefit from heartbeat monitoring, because the failure is always slow and invisible:
# Log cleanup — monitor it because the failure is silent and slow
0 4 * * * find /var/log/app -mtime +30 -delete && curl -fsS https://ping.trebben.dk/log-cleanup
Daily revenue reports, weekly analytics emails, monthly compliance exports. Someone important depends on these, and the first you'll hear about a failure is an angry email from finance asking where the report is.
Why it breaks: Report jobs are the most dependency-heavy cron jobs. They query databases, call APIs, format data, send emails. Any one of those dependencies can fail. And because the output goes to a human, not a system, there's no automated check on the other end.
What to do: Heartbeat monitoring turns "angry email from finance" into "alert at 7:01 AM when the report didn't generate":
0 7 * * 1-5 /generate-report.sh && curl -fsS https://ping.trebben.dk/daily-report
The ironic one. You set up a cron job to check if your services are healthy — ping endpoints, verify disk space, test database connectivity. It's your monitoring system. And then it stops running, and you've lost your eyes.
Why it breaks: Monitoring cron jobs have the same failure modes as every other cron job. Server reboots, cron daemon crashes, OOM kills. The difference is that when this one breaks, you lose visibility into everything else.
What to do: Your monitoring monitor is the single most important cron job to put behind heartbeat monitoring. It's the one that justifies signing up for a service like this in the first place:
# The monitoring job pings CronPulse after completing its checks
*/5 * * * * /health-checks.sh && curl -fsS https://ping.trebben.dk/health-monitor
All five share the same root cause: cron has no concept of "expected success." It runs commands. Whether they work, whether they produce useful output, whether they run at all — that's not cron's problem.
Heartbeat monitoring inverts this. Instead of trying to detect every possible failure, you confirm success. When the confirmation stops, you know something broke — even if you don't yet know what.
Start monitoring your cron jobs — freeBuilt by Jeff in Denmark. See also: free developer tools.