Understanding MOS for voice quality

Looking to improve the quality of voice calls? Learn which mean-opinion-score models measure and rate the user experience and the factors that influence voice quality.

Irwin Lazar, Metrigy Research

Published: 21 Dec 2018

User satisfaction with the dialing experience and perception of voice quality are the foremost evaluation factors used to judge quality of phone calls. Users assign descriptive adjectives -- good, OK, poor or terrible -- to voice quality. The user's judgment of voice quality is subjective. If you know the caller well and understand his speech patterns, even a poor connection can carry an understandable conversation.

Now, imagine that you have the same poor-quality connection with someone who has an unfamiliar accent or speaks rapidly. Understanding the conversation becomes much less likely.

Factors that affect mean opinion score

Good clarity is the primary description of an acceptable voice call. Clarity is the speech clearness, fidelity, intelligibility and lack of distortion. The following five components define the elements of sound quality for one direction of a call:

The speech volume level cannot be too low (whispering) or too high (shouting).
All speech is distorted, even in the public switched telephone network (PSTN) conversion of analog to digital speech. Is the distortion perceivable to the listener? The greater the distortion, the poorer the comprehension of the conversation will be. You may not even be able to recognize the speaker.
Background noise exists in the form of static and hum in all calls. This is known as noise level. The noise may be at a level low enough for the listener not to notice it at all or at a level so high it impedes clear conversation. For example, high noise level occurs when one of the participants in a call is at a loud public location.
The signal level -- loudness -- may change, increasing or decreasing during the call.
Audio range. Wideband audio codecs supporting high-definition audio are able to capture a wider range of audio frequencies, resulting in a conversation with higher audio fidelity than is possible using narrowband codecs.

A variety of factors can negatively influence clarity. These include the following:

Echo is the sound of the speaker's voice returning to -- and being heard by -- the speaker. Think of echo as a problem of long round-trip delay. The listener may not perceive short delay echoes. The longer the round-trip delay, the more difficult it is for the speaker to ignore. The speaker will probably pause so the echo does not interfere with the speech.
Latency is the time it takes for speech to travel from the speaker's mouthpiece to the listener's earpiece. The PSTN, within the U.S., usually has a delay of 30 ms or less. The latency goal is to have a one-way delay of 100 ms or less in voice over IP (VoIP) calls, with an upper limit of 150 ms. Very long latency will cause the speakers to pause, because they are not sure when the other speaker has finished, or they may barge in on each other's conversation. Latency is especially difficult to control when accessing VoIP services via the internet.
Silence suppression and voice activity detection performance. Silence suppression is used in VoIP to reduce bandwidth consumption. When these technologies are used, the beginnings and ends of words tend to be clipped off, especially the "T" and "S" sounds at the end of a word.
Echo canceller performance. The longer the latency, the more the echo needs to be eliminated. Echoes may occur in only one direction or in both directions. The echo cancellers may not work, or they may not be able to effectively compensate when there is significant jitter during the VoIP connection.

Technologies that can improve voice clarity can include the following:

Directional microphones. Meeting room microphones in ceilings, on tables or in speakerphones may pick up background noise, resulting in a poor call quality. Many microphone systems and phones use intelligent directional microphones that sense the location of an active speaker and disable input from microphones away from the voice source to eliminate extemporaneous noise during a call.
Noise cancellation. Active noise cancellation uses algorithms in phones, headsets and software clients to cancel out background noise in the receiver, resulting in a higher-quality call experience. Some headsets also offer noise cancellation in the transmitter, limiting the amount of background noise transmitted by the microphone. Passive noise cancellation provided by over-the-ear or in-ear headsets may further block out background noise.

Evaluating the mean opinion score

Mean opinion score (MOS) is a standard numeric value, defined by the International Telecommunication Union (ITU) in recommendation P.10 used to measure and report on voice quality. MOS has a range from a maximum score of 5, which is considered to be the same as speaking directly into the person's ear, to a minimum score of 1, which is an unacceptable voice quality to all users. MOS does not include what has been defined as the call experience, only the sound or voice quality.

An MOS of 4.4 to 4.5 is considered equivalent to a toll-quality call as experienced on the PSTN. Users who experience an MOS of 4.5 will be very satisfied. An MOS of 4.0 is still considered acceptable to the vast majority of users. When the MOS decreases to 3.5, some users may find the voice quality unacceptable. Most non-HD voice cellular calls have an MOS rating of 3.8 to 4.0, where speaker and word recognition may be impaired.

When the MOS falls below 3.5, users will be dissatisfied and will either retry the call or potentially contact the help desk and open a trouble ticket. An MOS below 2.6 is considered to be an awful call. The user with an MOS of 2.6 will need to find an alternative network for this call -- for example, when a wireless call is terminated and the speaker moves to the PSTN.

The ITU P.800 standard for MOS defines scoring for narrowband calls. The P.800 methodology is based on having approximately 30 or more people, sitting in a quiet location, listen to eight to 10 seconds of speech under controlled conditions. The listeners are asked to rate their opinions of the calls from very satisfied to awful, scoring the calls from 5 to 1.

A newer standard, called E-model, defined in ITU-T Rec. G.107 uses measurements of transmission parameters to calculate a quality score, called the Transmission Rating Factor, or R-factor. R-factor scores may be converted to MOS. R-factor is especially useful for measuring improvements in voice quality when using wideband audio codecs.

Today, VoIP management platforms use an automated approach, based on ITU standard P.862, to calculate MOS and R-factor based on measured latency, jitter and use of wideband audio for both live calls, as well as synthetic transactions that simulate call performance. This approach provides those responsible for VoIP quality management not only historical call quality performance data, but real-time alerts when performance scores drop below an acceptable range. VoIP management platforms are typically able to identify the culprit causing a poor-performing call, giving IT support teams the ability to quickly address and rectify performance-related issues.

Understanding MOS for voice quality

Looking to improve the quality of voice calls? Learn which mean-opinion-score models measure and rate the user experience and the factors that influence voice quality.

Factors that affect mean opinion score

Evaluating the mean opinion score

Dig Deeper on VoIP and IP telephony

jitter

A look at effective voice user interface design

Microsoft updates Teams to filter out echoes

Cisco adds calling during internet outages to Webex