What is an overlay network? What's a DHT? How does a node compare to a supernode? What differentiates a "pure" peer-to-peer (P2P) system from a "hybrid" system?
I have a series of posts planned over the next few weeks related primarily to Skype and some of the changes brought about in Skype 5.0 for Windows that are interesting from a technology point-of-view - but in order to write those posts, I need to build a bit of a foundation of some of the issues and terminology used in peer-to-peer (p2p) networks.
If you are not familiar with peer-to-peer (p2p) networks and the terminology associated with them, my goal is to give you a basic background. If you are familiar with P2P technology, well, you can probably skip this and wait for the next post :-)
Peer-to-peer vs. Client/Server Networks
For starters, let's differentiate between "peer-to-peer networks" (p2p) and more traditional networks. If you think about running, say, AOL Instant Messenger (AIM), on your computer, you install the AIM client software on your computer. When you launch the software, it connects you to the AIM servers running somewhere out on the Internet. When you send a message to another AIM user, that message goes from your client up to the AIM servers and then from there down to the recipient's client.
This is "client/server" networking and is how the majority of the Internet-based services we use are structured. We operate web servers, mail servers, file servers, etc., etc. and have local clients installed on our computers that connect to those servers. Even in many of the "cloud" services out there, we are still using a client (typically a web browser) to access services running on some big set of servers out on the Internet. (And the services may in fact be client/server-based, but with all of that being hidden in internal infrastructure.)
Peer-to-peer networks are completely different in that:
There are NO servers.
(Well... at least in "pure" P2P networks. We'll talk about hybrids later.)
Instead, every "node" on the network is a "peer" - it is both a client and a server. It acts as a client accessing resources from other peers - and acts as a server providing resources to other peers. The computers that are running the p2p software are all participating in a "p2p network" connecting the computers (nodes) together.
While I've heard people lately talking about that p2p network as a "p2p cloud", the technical term is to refer to it as an "overlay network" (or "network overlay" or just "overlay"). The idea is pretty straightforward... you are running software on your computer that "joins" a network that exists on top of the actual Internet connections. Here's a picture of how it might look:
The "Internet" is in fact a collection of top level Internet Service Providers ("ISPs") who run the "backbone" of the Internet and connect out to 2nd tier ISPs who connect to 3rd tier ISPs who connect to... etc., etc. until you actually get down to the network drop or wireless service you use to get access for you PC (using "PC" generically here for a computer... could be Windows, Mac, Linux, whatever). Everything is all interconnected via IP addresses and running the TCP/IP protocol suite at the networking layer.
The "overlay network" is created on top of that IP network by PCs running the P2P software. They connect to each other and communicate with each other regardless of what the underlying network infrastructure actually looks like.
Now, I've shown it in the drawing as a nice somewhat circular ring because it was convenient to do so. In reality the picture would undoubtedly be a whole lot messier - but logically it would still function a lot like a ring in most P2P networks. (And note that some people do talk about an overlay networking being a "p2p ring".)
Distributed Hash Tables (DHTs)
A question, of course, is ... how do those peers know where each other are? If the overlay network is on top of the IP network... still, ultimately the data connections have to happen over IP - how does a peer know where to send packets?
Now this topic gets into a whole area of research that network geeks like me may find absolutely fascinating... but everyone else may find their eyes glazing over.
Suffice it to say that there are a whole number of different algorithms for building p2p networks, and most today use some form of a distributed hash table (DHT). The idea is that you have some piece of information, like my user name, for which you want to look up other information such as how to reach me. Your P2P software performs a "hash" on that info (called the "key") and winds up at an address for that information inside the overlay network. It then queries the overlay network and the node responsible for that "address" responds back with the requested information (the "value").
Think of a DHT as a giant distributed database of "key / value pairs". (And in fact some large distributed database systems use DHTs.)
There is a LOT more I could get into... but that would quickly take this out of "A Basic Primer". Those wanting more info should check out the DHT entries on Wikipedia and the P2P Wiki, both of which are good. I mention DHTs here primarily because any time you go looking for P2P info on the net, you pretty soon wind up reading about DHTs.
If every peer can act as a server, how do other peers know what services a given peer can offer? In our real-time communication world, one peer might have a PSTN gateway connected to it. Or maybe an automated attendant... or a voicemail server. Or maybe it's running on a mobile client with reduced capabilities.
One challenge in a P2P network is to learn what resources each node has. This is typically referred to as "resource discovery" and again, different P2P networks have different methods of solving this issue.
Related to resource discovery is the whole "directory" issue - how do you find other users? If you think of Skype - or any P2P communication system - if I want to call or IM someone, how do I know where they are in the P2P network?
Enrollment / Authorization / Authentication
There are two security-related issues that come up (well, there are many but two key ones). One is - how does a new node join the p2p network? And the second is - how does an existing node that dropped out of the p2p network (perhaps going offline) get authorized to re-join the network?
Some P2P systems refer to this first issue as one of "enrollment." How does a node "enroll" in the p2p overlay network?
The second question is one of authorization / authentication of the node that it is allowed to join the p2p network.
Again, there are multiple ways to solve this issue including, at least in Skype's case, of adding in some servers to the mix to perform these functions.
A fundamental challenge of any P2P network is how to deal with the "churn" of nodes dropping in and out of the overlay network. For instance, as computers go offline and then come back online, they leave the overlay network and then rejoin. Different overlay network topologies have different strategies for coping with churn and minimizing its impact.
Nodes vs "Supernodes"
A second fundamental challenge for P2P networks is our dear old friend "Network Address Translation" (NAT) and all the related firewall fun. Given that a node behind a firewall cannot directly communicate with a node behind another firewall, how do they get connected?
The idea is that you have another network node that is not behind a firewall that can act as a relay between the two nodes behind firewalls. In Skype's terminology, these relay nodes are referred to as "supernodes".
"Pure" vs Hybrid P2P Networks
In an ideal world, a "pure" P2P network has no servers whatsoever. Every node on the network is a peer and functions as both a client and a server.
Sometimes, though, reality intrudes. Sometimes servers may need to be introduced into the mix in order to provide specific services for performance, security or other reasons. For instance, in Skype's case the general understanding is that servers are used for enrollment (to have a Skype client join the P2P overlay network for the first time), for some degree of authentication and also for connections out to the PSTN.
Networks that introduce some level of servers are typically called "hybrid" p2p networks.
That should be enough...
... for me to get started with my articles. I hope this was helpful - and if you have questions about items here or additional points you think I should mention, please feel free to leave them as comments.
Lots of good info out there about P2P networks in general... here are some you may find useful:
- P2P Wiki (and its article on DHTs)
- P2PSIP.org - lots of good info here
- P2PSIP Working Group
- Concepts and Terminology for P2PSIP - good draft to understand IETF's terminology
- P2PSIP Base Protocol Internet-Draft
- For those looking for a background in P2P security issues, I assisted with a now-expired Internet Draft on P2PSIP security issues which you may find interesting.
- Skype: P2P Telephony Explained
- Skype: IT Administrator's Guide to Skype (PDF)
If you have other suggestions, please feel free to send them along.
Flickr credit: porternovelli - and yes, this image really has nothing to do with P2P networks - it's a graph of top PR Twitterers - but it could be for a P2P network... and in a way all of those Twitterers are an overlay network ;-)
If you found this post interesting or useful, please consider either: