I was in charge of managing a dedicated server running Debian 7, the server hosts multiple websites with email service, the server hosts also multiple instances of a critical web application, for a client who is running a business across different regions.
That day was a very important day, as the client was expecting his own client to turn up. The client wanted to make a demo on the application and show how they manage some business processes.
During that event, I got a phone call claiming client users not able to access the web application, I took the request as usual and started checking the filed issue, few seconds later I got another call about other users not able to access their mailbox, it is then I realized that a very nasty thing is happening and I'm in serious trouble.
I quickly figured out that I have made the worst mistake ever !
That day I was performing usual maintenance tasks on the server, freeing some disk space here and there. However, at some moment I deleted critical files that belong to different services like Postgres, Mysql, mail server...etc.
I didn't notice anything until I started receiving reported issues from the clients.
It was catastrophic in all corners
We lost three months of data as backups resided in a single place were I launched the deletion operation, there were no other copies of backups. Many services were surviving with what was left in RAM (I guess) and any respawned process was lethal for the corresponding service.
In the field, the client was badly embarrassed in front of his client, as he was cut off at the beginning of the demo.
Users started processing customer data manually using a pen and paper.
The recovering of the server was a "mission: impossible", we needed to reinstall everything from the scratch, but it was decided that the fastest recovery would be to migrate to another server with latest Debian version.
So I installed all the required services on the new server and restored the most recent backup, The business application was finally live.
One must learn lot of lessons because of this disaster.
Lessons learned
- The best recovery strategy would be to have a backup server ready for swift recovery in case of disaster.
- Keep copies of backups in different locations.
- Choose the proper time frame for maintenance and warn the users beforehand.
- Use existing tools for maintenance and cleaning up space on your server.
- Always check that you are in the right directory (pwd command) before executing rm -r.
- Never use a directory other than /home, for arbitrary files handling. Let the proper tools like apt-get do the work for you.
No comments:
Post a Comment
What do you think ?