It’s official – Skype blames the outage on Microsoft (indirectly)

Well, the official word is out from Skype and it can be summarized: the reboots from Microsoft patches triggered a previously-undetected condition and crashed out network

Skype PR staffer Villu Arak writes in “What happened on August 16“:

On Thursday, 16th August 2007, the Skype peer-to-peer network became unstable and suffered a critical disruption. The disruption was triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update.

The high number of restarts affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact.

Okay… I can buy that this type of thing could trigger some kind of chain reaction, but I don’t understand why this month was different than any other month.  For.. what? two or three years now (more?) Microsoft patches have been coming out like clockwork on the second Tuesday of each month.  Each second Tuesday or Wednesday, the millions of computers set to auto-update do so.  All those zillions of computers restart automatically.  Each and every month.  What was so special about this August that was different from every other month?  Was the number or restarts in a short period of time really that much different from other months? Why? Is the issue that there are so many more Windows Skype users than in previous months and years? Was this just the so-called “tipping point” when there were enough Windows Skype users that the normal restarts triggered this chain reaction?

The issue has now been identified explicitly within Skype. We can confirm categorically that no malicious activities were attributed or that our users’ security was not, at any point, at risk.

In other words, it was not a DDoS by Russian hackers, as one rumor had it (which had actually already been dismissed by every security researcher who looked at the alleged exploit code).

This disruption was unprecedented in terms of its impact and scope. We would like to point out that very few technologies or communications networks today are guaranteed to operate without interruptions.

Fair enough statement – if you are looking at data or web technologies… but the PSTN, to which Skype would seem to like to be compared, is designed to operate without interruptions (or with as minimal as possible).  You know, there is this wee little market for “carrier-grade” equipment/software/etc. that is designed to be highly available without downtime.  If a carrier’s network were down for over 48 hours, there would be a zillion lawsuits, intense government inquiries and more.  The carriers that make up what we call the “PSTN” put an incredible effort into ensuring availability.  If Skype wants to play in that game, they have to be ready to play at the same level.

Skype has now identified and already introduced a number of improvements to its software to ensure that our users will not be similarly affected in the unlikely possibility of this combination of events recurring.

Good. We would expect that.

I appreciate that Skype has been as communicative as they have through their blog and heartbeat site.  Thank you, Skype, for communicating – and leaving the comments open.  However, to me the information provided today is still lacking one key piece:

Why were the mass restarts associated with the August 2007 Microsoft updates different from the mass restarts associated with any other month’s Microsoft updates?

Technorati tags: , ,

2 thoughts on “It’s official – Skype blames the outage on Microsoft (indirectly)

  1. GD

    I think the answer lies in the part of the blog post not mentioned here
    “Normally Skype’s peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly”
    It’s really not Microsoft’s fault of being the catalyst by which X numbers of users rebooted simultaneously. If enough Skype users on windows had rebooted independent of the system update, the error would have occurred as well. I also suspect that because of the nature of peer-to-peer that once the problem was started, it probably prompted other Skype clients to reboot thus making it worse with no way for Skype to control it short of shutting down and letting it play out.

  2. Paul Smith

    I don’t really understand why people are saying that Skype haven’t explained things. To me, it seems clear, there was a bug, they’ve admitted that, they haven’t tried to pass the buck or anything.
    Maybe there was one extra user this month, or one server was down for maintenance or something, and everything should have worked fine, but there was a bug, so it didn’t. Also, I doubt the server code is static, so there may have been a minor change from last month which worked OK normally, but failed under excessive load.
    Do Skype really have to give us that much detail? OK, techies have a natural desire to know the intricate details, but there’s no reason we have to be given them
    By the way, I bet when the PSTN network was 4 or 5 years old there were times when trunks were down for a day or so… When/if Skype gets to be 100+ years old, then I’d expect problems like this to be a thing of history!

Comments are closed.