For the first time in 3 years, Skype was down today – and as I write this is still in the process of slowly coming back online. A ton of articles were written today, mostly all pointing back to Skype’s blog post or status update, which most importantly said this (I’ve shortened it a bit):
Some of these computers are what we call ‘supernodes’ – they act a bit like phone directories for Skype. If you want to talk to someone, and your Skype app can’t find them immediately … your computer or phone will first try to find a supernode to figure out how to reach them.
Under normal circumstances, there are a large number of supernodes available. Unfortunately, today, many of them were taken offline by a problem affecting some versions of Skype. As Skype relies on being able to maintain contact with supernodes, it may appear offline for some of you.
Let’s explain this a bit more.
Explaining Supernodes
If you go back and read my primer on the technology behind Skype and P2P networks, I described supernodes as Skype clients that are on the public Internet and NOT behind a firewall or NAT device that broker the communication between two Skype clients. In a very simplistic view, the picture looks like this:
As I note in the update section to that post, the Skype clients acting as “supernodes”:
perform the somewhat limited functions of connecting nodes together, providing a distributed database and choosing appropriate nodes to act as “relay nodes” when necessary.
The supernodes are what connect invidividual Skype clients to each other and create the P2P “overlay network”… the “cloud”… that connects all Skype clients to each other.
These “supernodes” run the regular Skype software. The ONLY difference is that they are on the public Internet. So if you are running Skype on a computer – and you are NOT behind a firewall, there is a chance that your computer could become a supernode. That’s just how Skype works. So there are a lot of these supernodes out on the public Internet:
Here’s the thing… EVERY Skype client is connected out to a supernode. You have to be, in order to be connected to the larger directory of Skype users and for them to know how to reach you. (Note that Skype clients behind the same firewall may not be connected to the same supernode.) So it may look like this:
The supernodes are then connected to each other… creating Skype’s globally distributed directory database, which in a simplified form you could think of like this:
(Skype’s supernode connection algorithm is presumably more complex than the simple mesh I’m showing here… but the point is that they are connected to each other.)
Now, Skype’s picture is not exactly like this. We know from the explanations of the 2007 outage that Skype uses a hybrid architecture that involves some “authentication servers” that Skype clients connect to in order to first be granted access to the Skype P2P cloud. I’m not aware of anyone publishing technical details on exactly how those authentication servers connect into the Skype infrastructure, but let’s just say it looks something like this:
Skype clients need to connect to these authentication servers in order to validate their username and password, and presumably to validate their calling plan, how much money they have left in their account for calls, etc.
Now, the cool part about the “self-healing” aspect of the supernode architecture is that if a supernode goes down, Skype clients will simply attach to another supernode:
The problem with the outage today seems to be, from Skype’s explanation, that a great number of supernodes went offline, tearing apart the fabric of Skype’s P2P network overlay:
OOPS.
Something broke. We don’t know what. Skype’s blog post says only:
Unfortunately, today, many of them were taken offline by a problem affecting some versions of Skype.
What was the “problem affecting some versions of Skype“? No clue. Was it a software update that somehow affected the supernode algorithm? Did it affect the communication with clients?
No clue.
But according to Skype, that’s what happened. Hopefully they will be a bit more forthcoming soon (although perhaps NOT, given their pre-IPO status), but at the moment that’s all we have to go on.
My guess would be that there might also have been “cascading failures” in this scenario. If there was, say, a software update affecting some supernodes, as those supernodes dropped offline, the increased load of Skype clients trying to connect to online supernodes might have caused some of them to then drop offline. Or when a supernode came back online, it may have been overwhelmed by the quantity of connection requests and soon failed again. As I said, that’s purely a guess… but you could see those kind of failures happening in a situation like this.
Skype’s “Solution”
As a solution, Skype’s blog post says this:
What are we doing to help? Our engineers are creating new ‘mega-supernodes’ as fast as they can, which should gradually return things to normal. This may take a few hours, and we sincerely apologise for the disruption to your conversations. Some features, like group video calling, may take longer to return to normal.
No details yet on what these “mega-supernodes” are, but some speculation is that instead of relying on individual Skype client computers to “become” supernodes, Skype is going out and setting up computers/servers specifically as supernodes. Rather than rely on potentially unstable computers, Skype goes out and gets some rock solid servers under their own control and sets those up as supernodes.
Maybe that’s what a “mega-supernode” is. Maybe it’s a higher level supernode… to which “regular” supernodes connect. Again under Skype’s control… but providing a tighter core P2P network that houses the overall directly.
We don’t know yet… but those are the kind of things Skype could be doing. Again, hopefully we’ll get more details soon… although we’ll have to see.
As I write this, my Skype client shows 4.5 million users online… it’s the beginning of the day in Europe and I’m sure folks there are trying to get online. Hopefully Skype will be getting their network back online soon.
And hopefully we’ll get some better technical explanations, too!
NOTE #1: It should be noted that there are other types of “servers” connected to the Skype P2P cloud beyond the authentication servers. There are also the servers and gateways used for SkypeOut and SkypeIn, gateways to mobile operators, web presence servers, etc. I left them out for the simplicity of the drawing.
NOTE #2: I am not an employee of Skype and do not have any inside information about the workings of Skype. The information in this article is based on what technical material Skype has made publicly available plus information a number of us have been gathering over the years. It may or may not be accurate.
If you found this post interesting or useful, please consider either:
Skype is never problem free — it’s garbage code. And far from a minor story as the Times is playing it here, as though it’s just out for a few hours, it’s been out all day and counting. It’s always unreliable.
I’m not convinced by this explanation, nor by Skype’s claim because the effect was sudden. If a bad release of the software were to blame, the onset in the scenario would have been gradual over days or weeks as super-nodes happened to be upgraded to the bad release.
My Skype client could not authenticate most of the time, or if could authenticate, was unable to make calls even to the echo testing service. It then would drop out and be unauthenticated again fairly quickly thereafter.
This points to a potential DOS attack on Skype’s login servers: an obvious vulnerability, rather than a loss of super nodes, which are dispersed and theoretically in a reasonable surplus.
In this scenario it would make sense to (1) do dynamic rate-limited filtering to prevent the majority of repeat (attack) traffic from reaching the authentication servers (2) throw extra authentication servers into the mix on IP addresses that are new and thus not immediately subjected to attack.
That is exaxtly why we try to ban Skype here and at other locations: Becomming a “supernode” can take a lot of YOUR bandwith away. You become part of Skype’s network. Well, without sharing the payments, of course 😉
I think exactly in the same way you do. And that would also explain why Skype has been so vague in explaining the problem they had.
As you describe nicely, their architecture is much more robust and redundant than most folks are saying. More robust than most communications service providers actually. Agree with you – cascading failures or a poor implementation of new code (aka not tested completely for a very important use case and/or rolled out way too aggressively/quickly). Also a pure guess – no inside info – related to a social network or video integration that they are trying to get some pre-IPO traction on and therefore moving too quickly.
I wonder if this has created any career opportunities ?
My guess – no inside info – new code to support an upcoming social network or videoconferencing integration, pushed out too quickly under pre-IPO pressure. Is important to note – as you describe well – Skype architecture is more redundant and resilient than most service providers.
Thanks for commenting… but I’ve actually had the opposite experience. I’ve been using Skype for now over 5 years, and for the most part it’s been incredibly reliable. Obviously there was the 2007 outage, and there are occasional glitches in connections, but overall the quality continues to be great.
Obviously, you experience is different.
R, I agree that Skype needs to explain more. It could be, though, that something happened to trigger the scenario… the instability could have been building for some time and then something happened that started a cascade of failures throughout the system.
I agree, though, that issues with the authentication servers could *also* be a factor here – or an explanation. Back in the 2007 outage I seem to recall the availability of those servers was a factor preventing people from logging in.
Thanks for the comment.
Keep in mind that the only way to become an actual supernode is to be out on the public Internet. If your systems are behind a firewall, they will not become a supernode. The entire point of supernodes is to connect people who are behind firewalls – and for that to work, the supernode needs to be out on the direct public Internet.
In what way?
Yes, it certainly could have been related to an update that was rushed out too quickly… at this point we’ll have to wait and hope that Skype gives more info!
Interestingly, Om Malik has a post up with an interview with Skype CEO Tony Bates:
http://gigaom.com/2010/12/23/skype-ceo-tony-bates-i-am-sorry-here-is-an-update/
and in it he includes this tidbit:
“A handful of Windows clients failed and set-off a chain reaction that brought down Skype.”
It will be interesting to learn what that is about.
Skype has been running updates recently, one of which killed skypemate on one of my XP machines. I can see how a poorly timed update could kill a network, but this time they were being quite gradual about it. As a point to point network, something piggy-backing on a data update could have caused a cascade failure. As the supernodes are unprotected by a NAT, they become vulnerable to certain types of attacks. I’ll be curious to see where the fail point was.
What was the “problem affecting some versions of Skype”? No clue.
It was probably simply that in a connect fail, some version of Skype retried excessively, thereby saturating the supernodes in much the same way as a DDOS attack.
Based on the architecture described above, it would be hard for outsiders to detect the gradual decline. As supernodes updated and got whatever bad update, the un-updated supernodes would become more heavily burdened. Somewhere in there, a tipping point would be reached where the enough of the supernodes would have been updated – and therefore unavailable – to where the existing supernodes would be overloaded and simply unavailable or even knocked offline.
If they were knocked offline, all of those clients would attempt to connect to another supernode.. lather, rinse, repeat.
you are probably correct in saying multiple connect fails was the culprit . my windows fire wall had had automatically written into the permissions sector skype allow five hundred times !!!!!
I have been a regular Skype user in Japan for the past five or six years. It has been great for the most part, but like everybody else I have had my share of disconnects, audio breakup and video jitters. However, the low cost more than makes up for the occasional problem. Skype has saved me probably thousands of dollars and given me things I could never afford in the dark days of international POTS telephony; a way to call New York on a monthly basis for as long as I want and a way to access US toll-free numbers that are actually “toll-free” (my Japanese providers tried to charge me their “cut” in the past but Skype has eliminated that practice for me). Skype also allowed me to eliminate my hugely expensive land-line two years ago when NTT chose not to compete with them and the other VoIP providers, an action that reminded me of ATT when it held a virtual monopoly in the American market. Thank you Skype, and when your IPO comes out, I will be in line for my piece of the action.
As for the recent outage, I was not seriously affected. You’ve all discussed eloquently (and I might add, quite politely – how nice and rare it was to read a civil discussion online for a change) the possible technical reasons, however it seems reasonable to think a poorly timed software upgrade and the increased volume due to holiday traffic conspired to cause the outage. A foreign or domestic hacker attack could have been the trigger as well, Skype being the big target that it is for such things. If I were Mr. Bates I would be choosing my words carefully too, until the facts became totally clear. Hopefully we’ll know more in the coming days.
I was, apparently, one of these “Supernodes”. What happened was this:
Step 1: My Skype client crashed, out of the blue, for no apparent reason
Step 2: Tons and tons of attempts to connect to me as a supernode occured, nearly DDoS’ing my connection.
/Z
I was, apparently, one of these “Supernodes”. What happened was this:
Step 1: My Skype client crashed, out of the blue, for no apparent reason
Step 2: Tons and tons of attempts to connect to me as a supernode occured, nearly DDoS’ing my connection.
/Z
Do you think Skype going down on the 23rd had anything to do with the fact that a patent infringement lawsuit was filed by Gradient Technologies- also on the 23rd? This is very interesting- and coincidental that the outtage happened the same day legal action was taken.
Dan, IMO Skype would be smart to avoid or limit using Windows systems as supernodes. Given the incidence of malware penetration on the typical desktop, particularly here in South Asia (Singapore, reportedly 95%+ of all Net-connected desktops and India/Pakistan limited only by Net connectivity), running *any* sort of P2P software on nodes like that is just down on your knees begging and pleading for trouble. It looks like Murphy obliged.
So Skype uses only totally unprotected MS Windows OS loaded (those windows computers attached to the world wide web not behind a firewall), computers & servers loaded w/ their Skype software as free SuperNodes to forward & route all Skype user traffic using up the bandwidth of uncompensated owners & the only true fully unprotected idiots left in the world? (In explaination….IMHO, if you are out on the net and that box, tower, blackberry, iphone or lapwarmer, etc. isn’t firewalled…..You Are An Idiot!).
Every Windows loaded puter I saw this week for repair or service (more than 30 btw), was doing a final set of updates similar to a Patch Tuesday event. (An end of year prod by MS perhaps to get the OS’es up to par & end their year w/ an unintended Skype bang?)
Of course not every one of those sets of updates required a restart, but most did. Of course EVERY ONE of those units were firewalled and all but one had Skype set to NOT start w/ Windows but from the desktop as and when needed or requested. (And that one will end soon as China is now saying in an article on Yahoo News they will outlaw Skype, and her husband is working in China while she’s home in the US).
So…The true backbone of this new great future IPO offering in VoIP will be the last few thousand totally unprotected (sic) windblows machines that happen to have Skype software installed and Skype doesn’t expect these machines to be DDoS & virus attacked (among others)often?
Good Luck With That!….Just Sayin’.
Maybe the last Win 3.1 or Win98SE user (not behind a firewall at least), had to restart (reboot) his box.
Using the free unprotected windblows spam botnet of the world as a backbone and Skype is actually going to offer an IPO?
Install some hardware yourself Skype, then I might buy some of your stock, but these outages will be often & long I predict using that unprotected user backbone.
What do you think of this article on Skype future? http://www.reviewmaze.com/2010/12/skype-outage.html
PING:
TITLE: Skype to boost headcount by 50% this year and offer SLAs
BLOG NAME: Disruptive Telephony
The Financial Times is out this morning with an article about Skype CEO Tony Bates and his plan to hire around 400 more people this year. The article offers some insight into his thinking, and included this piece related to…
PING:
TITLE: Want To Discuss the Skype Outage? Join the VUC Call Friday, Jan 7, at Noon US Eastern
BLOG NAME: Disruptive Telephony
Want to discuss the Skype outage? What happened? What we know about it? Supernodes? Mega-supernodes? Skype’s architecture? and more? On this Friday, January 7, 2011, I’ll be joining the VoIP Users Conference (VUC) gang to talk about Skype’s outage. As…
Some interesting points being made in the comments, I would certainly like to know a bit more about what happened.