A self-healing server

29 Aug 2007

I just deployed the most risk piece of code on Sampa to date. It’s a self-diagnostic app for our servers that try to recover in case of “issues”. I should be cheering and every customer should be jumping happy that the service will be more reliable, so why am I so concerned, you might ask.

Thank you for asking.

My concerns come from seeing this self-diagnostic and self-recovery tools backfiring more often than it’s useful. Take a big datacenter with a latest generation power generator. How many times your heard the story the first time the generator was needed it failed? How many stories of backups that didn’t backup and when you need it failed? Or, the much worse case, when you tried to restore a single piece of data and it erased everything by mistake?

If you worked on a datacenter you heard even more “stupid” stories of Cisco Routers or load-balancers doing crazy things to re-route traffic because they thought a server or router was down, when in fact it was a false positive. Or when Skype was offline for hours last week because they “recovering system” had a bug. Shit like this happens all the time.

But last Friday we had a “bit scary” moment. One of our servers become inaccessbile and no site hosted on that server could be reached. It took about 2 hours to find out the server was down, for me to take a quick peak and diagnostic the service as a network issues, reboot the server and everything went back to normal.

But what if it had happened at 11PM and we would only have noticed in the next morning? What if it had taken me 4 hours to figure out the issue instead of just a few minutes?

This is where self-diagnostic and self-healing is quite powerful, but it must be done very carefully to avoid false positives.

My experience with MSN Search downtime is that 90–95% of all downtime during a year occurs because of Network Issues, including: DNS misconfiguration, Router/Switch issues, Top-tier Internet traffic router going down, scheduled upgrades/maintenance that go awry, etc. Quickly diagnosing the issue is the number 1 priority for anyone maintaining servers. The next step is to quickly fix the issue.

So, I’m building knowledge of issues like this and with time we will improve what things we diagnose, how the service reports them back to us and what things we can take an automated action to fix, like restarting a service or rebooting the server.

A self-healing server

Marcelo Calbucci

PRFAQ Toolkit