All projects
04Infrastructure2024

AlmaLinux Disaster Recovery

Boot-level recovery and hardening for a mission-critical AlmaLinux production server.

Context
Production recovery, VM ops
Focus
Boot repair, safe restore, hardening
Stack
AlmaLinux · Linux · GRUB · chroot · Rsync
Outcome
Service restored + runbook

What happened

A boot-level failure.

The server hit a boot-level failure that prevented normal startup. Recovery required taking control of the machine state, preserving evidence, restoring service safely, and then hardening the setup to reduce recurrence.

  • Stabilize first: stop automation that makes the failure worse.
  • Preserve evidence: logs and disk state before invasive repair steps.
  • Restore service safely: avoid “fixes” that create silent corruption.
  • Harden immediately after: write the runbook while it’s fresh.

Playbook

Correctness over improvisation.

  • Boot into a recovery environment (rescue mode / console access).
  • Confirm disk/filesystem health before mounting read-write.
  • Repair the boot chain as needed (GRUB, initramfs, configs).
  • Validate services and data directories, then bring traffic back gradually.
  • Document the incident timeline and convert the steps into a runbook.

Hardening

Backups with restore drills (not just “backup succeeded”), explicit alerts for disk pressure and failed backup jobs, configuration tracking so “what changed?” is answerable quickly, and a clear “stop the bleeding” policy during incidents.