WordPress Disaster Recovery: The 5-Phase Framework

Why frameworks beat heroics

In our first year doing WordPress emergency work, we learned a brutal lesson: every emergency is different, but the response pattern that works is always the same. The teams that handle disasters well don't have smarter engineers — they have a framework that prevents them from making the obvious mistakes that compound damage.

This article describes the 5-phase framework we use on every WordPress emergency, from "the site is down" to "this won't happen again." Each phase has its own goal, its own success criteria, and its own anti-patterns.

Phase 1 — Identify

The first thing to know is what actually broke. This sounds obvious but is the most-skipped phase. Engineers under pressure start fixing before they understand what they're fixing.

Goal: characterize the incident in three sentences.

What is the user experience? (white screen, redirect to spam, 500 error, slow page)
What is the technical signature? (PHP fatal, MySQL connection refused, malware payload signature, etc.)
What is the scope? (single page, single user, site-wide, multiple sites)

Standard identification toolkit

# What does a visitor see?
curl -I https://yoursite.com/

# What does WordPress think is happening?
wp doctor check --all

# What does the server log say?
tail -200 /var/log/nginx/error.log
tail -200 /var/log/php-fpm.log
journalctl -u mysql -n 200

# When did the failure start?
last -F | head -20   # recent logins / system events
find /var/www/yoursite -newer /tmp/marker -type f   # files modified after a known good time

Anti-patterns

Jumping into a fix before reading any logs
Assuming the obvious cause is the real cause (slow site = "needs cache plugin" is a 50/50 guess)
Trying to fix more than one issue simultaneously
Skipping the scope question and over-fixing

Time budget: 5–15 minutes. If you can't characterize the incident in 15 minutes, you're missing something fundamental — start at the network layer and work up.

Phase 2 — Contain

Once you know what's wrong, stop the damage from spreading. Containment is not the fix; it's the stop-bleeding measure that gives you space to do the fix properly.

Goal: limit blast radius without making things worse.

Standard containment moves

Take site offline if it's serving malicious content. A maintenance page is better than serving malware to your customers and getting blacklisted. We use a static maintenance HTML or temporarily redirect to a status page.
Snapshot current state immediately. Whatever you're about to change, take a backup of what it looks like right now. This is your fallback if your fix makes things worse.
Block the attack vector if active. Brute force attack? Cloudflare under attack mode. SQL injection probe? Block the IP at the WAF. Compromised admin user? Disable the user account.
Stop background processes that could compound the issue. Disable cron, pause backups (they'd back up the broken state), suspend deploy pipelines.

Containment for common scenarios

Scenario	Containment action
Active malware serving	Static maintenance page, block all writes
Brute force on login	Cloudflare under attack mode, rate limit
Database overload	Disable cron, kill long queries, isolate slow plugin
Hacked admin user	Reset all admin passwords, invalidate sessions, disable suspicious accounts
Plugin update broke site	Roll back the plugin, hold further updates

Anti-patterns

Skipping the snapshot because "we know what we're doing"
Hoping the issue resolves on its own
Communicating to customers before containment is complete (they'll ask questions you can't answer yet)

Time budget: 5–10 minutes.

Phase 3 — Recover

Now you fix the actual issue. This is where most articles spend all their time, but it's only one phase of five. Recovery work depends on what was broken.

Goal: restore normal operation with confidence the fix is durable.

Recovery principles

Fix the root cause, not the symptom. If 500 errors started after a plugin update, don't just deactivate the plugin and walk away — identify why the update broke it, decide whether to wait for a fix or replace the plugin.
Test in isolation before deploying. If you're patching a function, run the patched file through php -l first. If you're changing config, check syntax before reloading.
Apply changes incrementally. One change, verify, next change. Big-bang fixes have higher rollback risk.
Restore from backup if uncertain. For hacked sites where you can't be sure you've found every backdoor, a clean restore from pre-incident backup is faster and safer than chasing every modified file.

Recovery checklists for the four most common emergencies

Plugin update broke the site 1. Identify the offending plugin (Health Check Troubleshooting, error log) 2. Roll back via WP-CLI: wp plugin install pluginname --version=X.Y.Z --force 3. Verify front-end and admin both function 4. Decide: stay on old version, or wait for fix?

Malware infection 1. Take full file + DB snapshot of compromised state (for forensics) 2. Restore from clean backup (taken before infection) 3. If no clean backup: forensic cleanup of every modified file (we use file integrity scan output) 4. Reset all credentials: admin passwords, DB password, salts, API keys 5. Scan for backdoors that survived (hidden mu-plugins, modified core files)

Database corruption 1. Stop writes immediately 2. Take a binary backup of the data directory 3. Try mysqlcheck --repair first 4. If InnoDB: innodb_force_recovery=1 through 6 in my.cnf, restart, dump good data 5. Restore from the dump into a fresh database

Server resource exhaustion 1. Identify the resource (CPU, RAM, disk, connections) 2. Identify the consumer (htop, iotop, SHOW PROCESSLIST) 3. Kill or throttle the consumer 4. Add capacity if it's a legitimate workload 5. Optimize if it's a misbehaving plugin

Anti-patterns

Applying multiple fixes without verifying each
"Fix" by uninstalling everything (loses customer data)
Skipping the backup-before-restore step
Declaring victory before testing in a real browser

Time budget: 30 minutes to several hours depending on severity.

Phase 4 — Harden

The site is back up. Don't stop here. The same attack vector that caused this incident will be used again — sometimes by the same attacker, sometimes by a different one running the same automated tool.

Goal: make the same incident impossible to repeat.

Hardening based on incident type

Incident	Hardening
Brute force succeeded	Add 2FA, change /wp-login.php path, IP whitelist if possible
Plugin vulnerability exploited	Remove vulnerable plugin, audit similar plugins, add WAF rules
Stolen credentials	Reset all passwords, audit account creation events, enable activity logging
Malware via wp-content/uploads	Disable PHP execution in uploads dir at server level
Database leak	Audit query patterns, restrict DB user privileges, encrypt sensitive columns

Generic hardening that helps every incident

Update WordPress core, themes, all plugins to latest
Remove all inactive plugins and themes
Set strong unique passwords for every account
Enforce 2FA for administrator and editor roles
Configure file integrity monitoring
Move backups off-server
Implement a WAF if not already present

Anti-patterns

"Quick hardening" — making changes without testing them
Adding a security plugin without configuring it
Sharing the new credentials in plaintext channels (Slack, email)

Time budget: 1–4 hours depending on the changes needed.

Phase 5 — Monitor

The final phase is the one that pays for the previous four. Without monitoring, you find out about the next incident the same way you found out about this one — too late, from a customer complaint.

Goal: detect a recurrence within minutes, not days.

The monitoring stack

Uptime — external monitor pinging every 60 seconds (UptimeRobot, BetterStack)
File integrity — alert when any file in wp-content changes outside a planned deploy
Database — alert when query patterns deviate from baseline
Error rate — alert when 5xx errors spike above baseline
Reputation — daily check against Google Safe Browsing and major blacklists
Audit log — every admin action recorded and reviewable

Where alerts go

File integrity changes → Slack, instant
Uptime down → Email + SMS, instant
Blacklist appearance → Email + SMS, instant
Error rate spike → Slack
Audit log review → Weekly email summary

The post-incident debrief

48–72 hours after the incident is closed, run a debrief:

What happened?
How did we find out?
What did we do?
What worked? What didn't?
What will we change in our process?

Document this. Three incidents into a real debrief practice, your team's response time will halve.

When this framework breaks down

The framework assumes you have access, time, and basic competence. It breaks down when:

The hosting account itself is compromised (lockout from the hosting panel)
The attacker is still actively in the system (you need to evict them first)
The database is destroyed beyond restoration and no backups exist
Legal/regulatory implications require external counsel before action

These are the cases where you call in specialists. We handle these scenarios routinely — average time to incident closure for a compromised hosting account is 4–6 hours.

Emergency response — under 15 minutes to first contact. Malware removal and hacked website repair follow this exact framework with the technical depth required for each scenario.