Voice over IP guide

About the Author: Lammert Bies is a dad, husband and polyglot. He is developing embedded systems since the eighties. Used machine learning before it had a name. Specializes in interconnecting computers, robots and humans. Was a Google Mapmaker Advocate and speaker on several international Google conferences from 2011 until the plug was pulled on Mapmaker in 2017. Bughunter with Google. Currently spreading artificial intelligence to the wildest locations in production environments. He never stops learning.

Voice over IP, an introduction
Digitizing and compressing speech
The CODEC, the VoIP compression workhorse

Voice over IP, an introduction

One of the newer and fastest spreading communication technologies in the world is without doubt Voice over IP, or VoIP. It is a technology where voice packets are digitized, routed over an IP network which could either be a local network or the internet and where the digitized packets are assembled to analog sound at the receiver side. The main use is to replace telephony and because of this, real-time performance of VoIP technologies is a major issue.

Several different implementations of VoIP have seen the world in the last years. Some of them are fully proprietary, while others are based on standards. Dependent on the technology, some allow integration with common telephony infrastructures which potentially gives you the best of both worlds. There are some things to consider though before deciding which technology is best for a particular situation.

Digitizing and compressing speech

The first step of every VoIP connection is digitizing analogue signals to digital packets. This can be done in a number of ways. The easiest way is to take a fixed sampling rate which is high enough to capture all needed frequencies and divide the signal strength in a number of levels. 8000 Hz and 256 sampling levels are common seen settings. In this way the signal is scanned by a normal analog digital converter or ADC which samples the data at the fixed frequency with a depth of 8 bits. The data is uncompressed sent to the other party and decoded by a digital to analog converter or DAC. The combination of 8 kilohertz and a sampling depth of 8 bits is good to replace normal telephony conversations. Telephony communications takes place at frequencies between approximately 500 Hz and 3500 Hz and 256 different levels is enough to provide good quality.

But good quality comes at a price. Uncompressed sampling of data at this rate generates a continuous stream of data of 8 kbytes/sec. This is not a big deal for broadband connections, but it can be too much for remote locations with slower internet connections—or even worse—via a mobile internet connection. Therefore several attempts have been made to reduce the number of kilobytes needed per second to achieve acceptable voice quality. This can in principle be achieved in several ways. You can reduce the sampling frequency somewhat, but this has a negative effect because higher frequencies get filtered. According to the Nyquist theorem—which dates back to 1928, long before there was any VoIP or even internet—it is impossible to digitize signals at a sample rate lower than 2 times the highest frequency in the spectrum. Lowering the sampling rate to 4000 Hz for example would reduce the maximum allowed frequency in the analog signal to 2000 Hz, which is way below frequencies which are common in speech, especially from women and children.

So reducing the sampling rate may help somewhat in reducing the bandwidth allocation of the VoIP application, it will only help a small fraction. Another approach is therefore to reduce the number of bits necessary to store one data sample. As mentioned earlier, 8 bits will give a reasonably high quality encoding of a speech signal. Reducing the number to 4 would reduce the bandwidth by 50%. Unfortunately this reduction also comes at a price. With 8 bits, there are 256 possible signal levels. Decoding such a signal back to analog gives a smooth signal where the step from one level to the next is less than 0.5% of the peak-to-peak signal value. Although a 0.5% distortion may be hearable, the speech will still be understandable and most people who are not specifically trained to detect these distortions won’t hear them at all.

With a 4 bit encoding depth of the signal, only 16 different levels are available. That is not much. Every step in the digital to analog conversion will be 7% which is linearly divided between the maximum possible signal strength peaks. Imagine that someone is speaking softly on the telephone where his signal strength won’t be more than 25% of the peak. In that case the digital encoding is almost binary, giving a Donald Duck alike sound at the receiver’s side. Reducing the sampling depth with 50% from 8 to 4 bits gives a quality degradation of a factor 16. That is also not what we want.

One solution to battle the problem of bad voice quality with low signal volumes is to not divide the signal strength graph in 16 equal levels, but to have more levels around the zero line, and less levels near maximum volume. A common approach is to use a logarithmic scale rather than a linear scale. Logarithmic scales are not strange in this application, because our ears roughly hear volume difference on a logarithmic scale. Ten times more volume in terms of energy is heard as about two times louder by the human ear.

Another solution is to use adaptive algorithms that divide the signal linearly, but not between the maximum possible peak-to-peak signal values, but between the actual values of the signal. If someone is speaking at low volume, these algorithms automatically boost the signal and the error of signal quantization is never much more than 7% at 4 bits sample depth. You can sometimes hear this type of signal conditioning on lines with noise, where the noise levels increases in between words or sentences. This is because the sampling algorithm amplify low volume signals, even if there is no real signal present. The amplification of the noise on the line is heard at the receiving end.

The best way to reduce the bandwidth needs of a VoIP application is to use a proprietary low-loss compression protocol. We all know compression from our computer. Applications like ZIP reduce the size of files by analyzing bit patterns and calculating alternative bit patterns and conversion tables that take up less space than the original file. Compression techniques as used in ZIP are called no-loss compression, because it is possible to extract the original files from the compressed version without any loss of information. Other techniques are low-loss and accept some information to be lost with the gain of extra compression. Low-loss compression methods are often used in picture compression as with the JPG format. The uncompressed version looks like the original, but at close observation you may see artifacts caused by the compression algorithm. This type of algorithm works best if they have been developed with knowledge of the data to compress. There haven been developed specific compression algorithms for voice compression which feature low-loss combined with a very small bandwidth allocation. Compression in mobile phones is one example of them.

The CODEC, the VoIP compression workhorse

With so many different ways digitized speech can be encoded to be sent over a digital line, VoIP applications must know which encoding method is supported by the other party, in order to make a successful connection. This is achieved by letting the encoding and decoding be performed by a standardized piece of hard- or software, the CODEC, or coder decoder. Codecs are used in many applications including video, but we will now focus solely on codecs that can be used with VoIP.

Name	Compression	Bitrate (kbps)	Application
G.711	A-law and µ-law PCM	64	General telephony
G.726	ADPCM	16, 24, 32, 40	International telephony, DECT
GSM 06.10 FR	RPE-LTP	13.2	Original GSM codec
G.729	CS-ACELP	8	VoIP over low speed connection

Common codecs in VoIP applications

Of course there are several dozens of codecs around, but I will stick to these four as they are the best known and available in most VoIP applications. It is amazing how many abbreviations and new terminology can be in one such small table, and I will therefore first give some explanation.

ITU-T standards G.711, G.726 and G.729

Standardization is important to let two VoIP applications communicate with each other. Fortunately the telecommunications sector has always felt the need to standardize protocols and information exchange and the first official organization for this goes back to 1865, the ITU or International Telegraph Union. This organization became an official United Nations agency in 1947. The standardization agency of the ITU evolved to the CCITT or Comité Consultatif International Téléphonique et Télégraphique in 1956 and was finally renamed to ITU-T in 1993. The abbreviation CCITT is still used in many places, for example when talking about specific CRC calculation algorithms.

The ITU-T has defined a number of speech compression algorithms which are used in national and international telephony communications. All these compression standards are named by the character G followed by a number. As a rule of thumb you could say that the numbering of the standard gives the sequence of the standards, and that higher numbers in general define standards with more complex compression techniques which require more computational effort than the lower number standards but have a better speech quality to bandwidth ratio.

A-law and µ-law PCM

The compression standard G.711 allows two ways of compressing incoming voice data. These two compression formats are often referred to as A-law and µ-law. Both compression standards use PCM or pulse-code modulation as the base data sampling method. With PCM the data is sampled at a regular interval. G.711 uses a PCM frequency of 8 kHz which results in 8000 samples per second. Each sample has a bit depth of 13 bits (A-law) or 14 bits (µ-law), which gives an initial high quality with only small errors present due to the quantization of the signal. The use of A-law and µ-law compression is mainly geographically defined. In North America and Japan mainly µ-law is used, where A-law is the standard for the rest of the world. There are also slight algorithmical differences which make A-law easier to implement and less computational resource intensive that its counterpart µ-law.

Anytime things appear to be going better, you have overlooked something.

CHISHOLM'S COMMENTARY