Voice over IP (VoIP) is all the rage these days, largely due to the proliferation of Internet Telephony Service Providers (ITSPs) such as Vonage, and the rising popularity of so-called soft PBXes such as Cisco Call Manager and the open source Asterisk. On the surface it seems easy to treat VoIP as another application on the network and just encapsulate voice signals into IP packets. However, coercing voice into data packets requires knowledge of how traditional voice services are provided and what overhead tasks are required to direct the call from one party to another.
Familiarity with these terms and concepts will also help consumers as they start investigating VoIP services and products. Not all ITSPs are the same, and understanding how your current service is provided will allow you to properly compare service offerings and avoid surprises down the road. Similarly, those looking to upgrade their current PBX can start to separate the marketing talk from the things that matter, simply by understanding how the phone system works.
The Public Switched Telephone Network (PSTN) is what most of us are familiar with. Subscribers are given a pair of copper wires (called a loop) connected from the demarcation point at their site to a phone switch at the Central Office (CO). On the user side, telephones are connected in parallel to the demarcation point. At the CO, this loop terminates in a card on a Class 5 phone switch. The Class 5 switches generally aggregate into one or more Class 4 switches which are responsible for marshaling traffic between all the Class 5 switches and the rest of the PSTN.
So far, the data guys in the crowd are thinking this looks quite similar to a data network, and they'd be right. The differences occur when people need to talk. In a data network, a subscriber wanting to talk to another subscriber would drop a packet on the wire with a destination address of the other party. The packet would flow, hop by hop, over the network until received at the other end. We call this packet switched because each packet is switched from the ingress interface of a device to an egress interface closer to the destination. Voice networks, however, are circuit switched, which means that a bidirectional connection has to be nailed up from the source to the destination before people can talk.
|
Related Reading
Asterisk: The Future of Telephony |
As I said before, the interface from you to the phone company is a copper loop. The port on the carrier side is termed Foreign Exchange Subscriber (FXS), and it provides dial tone and voltage to the remote device (your phone). Normally your phone is said to be on hook, which means the circuit between you and the FXS port is open from an electrical standpoint. When you pick up the phone and go off hook, the phone completes the circuit and allows current to flow. The CO provides you with the familiar dial tone and waits for you to enter digits. Each digit is a combination of two frequencies, which is called Dual Tone Multi Frequency (DTMF) signaling. It is further defined as in band, since the signals go over the same loop you use for talking. Contrast this to out of band signaling where the control connection would be over a different connection (we'll see examples of this later).
When the remote CO gets enough digits from you to determine who you're calling, it must build a connection to the other party. To do so, it sends a series of messages across the Signaling System 7 (SS7) network to build a dedicated voice circuit over the PSTN to the destination network. The remote CO rings the other party's phone, and when they go off hook, sends the voice traffic along. At this point both people can talk.
While the connection between the telephone and CO is an analog copper loop, the rest of the network is digital. Analog voice is sampled into digital signals for transit across the PSTN. Voice frequencies are generally under 4KHz (thousands of cycles per second), and the Nyquist theorem says we have to sample at twice this rate (8KHz) to get a good representation of the original signal. Choosing eight bits per sample, this means a voice stream consumes 64,000bits/S which is the fundamental building block of digital telephony. One such channel is called a DS0. The method used to convert the analog signal to a digital sample is called Pulse Code Modulation (PCM).
In order to maintain our sampling rate we have 125 microseconds (uS) between consecutive samples (1/8,000Hz), which is more than enough time to transmit the eight bits of information. Multiple samples can then be sent consecutively, each belonging to a different stream. This is called Time Division Multiplexing (TDM). Twenty-four such samples can be smashed together to give the next building block of digital telephony, called the DS1, otherwise known as a T1. Each DS1 frame consists of a single framing bit followed by 24 DS0 samples, for a total of 193 bits per frame. At 8,000 frames per second, we get the familiar 1.544Mbit/S for a T1. In Europe they use 32 DS0s and call it an E1.
From here we can cram more channels onto the wire as long as we respect the 125uS requirement. The important thing is that each conversation always appears at the same spot in each frame (also called a time slot). If there happens to be no conversation in a particular time slot, data must still be transferred! Likewise, if nobody is talking into their phone, the time slot is still used.
With all 192 data bits of the frame in use there's no spot for signaling between the two sides of the DS1. Robbed Bit Signaling (RBS) steals a bit from some of the DS0s, effectively reducing the sample size to seven bits or 56,000kbit/S. This is another example of in-band signaling. Alternatively, Integrated Digital Services Network (ISDN) protocols can be used to dedicate one of the voice channels to signaling and the other 23 to voice in a form known as Primary Rate Interface (PRI). The out-of-band signaling channel is called the D (data) channel, the voice channels are B (bearer) channels.
|
Packaging all the DS0s into a single physical connection has many advantages. The destination number can be passed over the signaling channel which means that a phone number isn't tied to a particular time slot, so many phone numbers can be used on a single access (these phone numbers are usually called Direct Inward Dials (DID)). Also, the number of DS0s you need is determined by how many simultaneous phone calls you want to have, not how many phones you have. Here lies another key difference between traditional and IP telephony: unused trunks are wasted in the traditional model, but they can be used for other things in the IP world.
While having a T1-type service to your large business might be feasible, 23 channels are a bit much for many smaller cases. There are two common solutions here, one is ISDN Basic Rate Interface (BRI), which provides two B channels and a D channel, along with the signaling available in the PRI service. The other option is to have a series of analog lines provided off an FXS port in the CO. Rather than plugging a phone into the line we provide a Foreign Exchange Office (FXO) port. A modem is an example of FXO, it can put the phone on and off hook, dial, and speak. Regular analog service, however, doesn't provide the signaling available in digital service, so giving individual DIDs to your employees isn't possible. The PBX answering the call can easily provide a menu that directs the caller to the appropriate internal extension, called an Auto Attendant. There is, however, an analog trunk service available which, when combined with different hardware, can provide the information necessary to have DID service.
When moving to the VoIP world things change significantly. The underlying transport, IP, is connectionless. This means that each packet is passed from hop to hop without the path being nailed up. The downside is that it's harder for us to allocate the network resources in advance, but it lets us make better use of the resources we have. The voice traffic itself is carried over Real Time Protocol (RTP), which itself runs over UDP. RTP adds sequencing information to the stream so that out of order packets can be detected. Note that RTP is still unreliable in that it doesn't retransmit missing or malformed packets, since a voice packet that arrives late or out of order is unhelpful.
Voice can be encoded in different ways which allows for compression. G.711 transmits the raw 64kbit/S PCM data in packets of 20msec. G.729 compresses the data down to 8Kbit/S with little loss in voice quality. There are many different codecs available, and both ends must agree on the same one to talk. Transcoding is the process whereby the voice stream is changed from one codec to another, which is used when the two endpoints can't use the same codec. Since voice compression is inherently lossy, multiple transcodings severely degrade the voice quality and this practice should be avoided.
Though the voice bearer traffic is easy enough to understand, it is the signaling that sets up and tears down calls that is complex. There are several protocols that can be used to do this, the most popular being H.323 and SIP.
H.323 is an older protocol (actually, a series of protocols) that was created by the International Telecommunications Union (ITU) for voice and video communications. SIP, the Session Initiation Protocol, is a creation of the IETF who also bring you such novel concepts as the internet itself. SIP was designed to carry a variety of information, from call setup, to presence and Instant Messaging. For either choice the purpose and operation are similar.
Consider the case of an IP telephone user who wants to call another user over the internet. Generally the phone itself has a network connection to a server that provides the call control, and for the sake of simplicity let's assume both phones use the same server. One phone dials a number (or a SIP address such as user@example.com) which is sent to the server.
The server checks to see if the call is allowed, and then invites the dialed party's phone to join the call. At the same time, a message is passed back to the caller's phone that the call is in progress, which results in the familiar ringback tone being played to the caller. When the dialed party's phone goes off hook, it sends a message to the server indicating it has been answered, which is again relayed back to the caller. This message is then acknowledged back to the server and then in turn to the phone.
At this point both phones are ready to talk using RTP. However, the flow of the packets is between the two phones--the server doesn't get involved at all. Unless transcoding is needed, or there are reasons to have the server interject, having the two phones call each other is the best use of resources. Removing the server from the flow also allows for better load balancing and reliability.
This model can be extended to multiple servers or even to the PSTN. In the latter case, one of the phones may actually be a gateway to a PRI. When the call gets to the gateway it dials out the PSTN to the remote telephone. There's nothing saying we can't do a lookup beforehand to see if the called party is better reached over the internet or the PSTN. The IP phone itself can also be an Analog Terminal Adapter (ATA) which connects a standard analog set to the IP network. If you've ever used a service like Vonage, this is what you're using. There's no limit to how we treat calls now that the intelligence has been moved closer to the user's hands.
The traditional PSTN is a complex creature, but it must be understood in order to integrate successfully with VoIP. Even though VoIP protocols borrow from traditional telephony concepts, the shift of intelligence from the core to the edge opens up a whole suite of new applications.
Sean Walberg is currently a Senior Network Engineer with a large Canadian financial services company, where he is building up the voice and data networks.
Return to O'Reilly Emerging Telephony
Copyright © 2009 O'Reilly Media, Inc.