Chapter 05

Voice Authentication

Chapter 05· 3 min read

Voice Authentication

Reading as a guest

Sign up free to save your progress, highlight passages, and pick up where you left off.

You'll lose your reading position and notes if you leave without an account.

Speaker comparison decides whether a questioned voice recording (extortion call, threat, kidnapping ransom) was produced by a known suspect. The technique combines acoustic measurement (formants, pitch, voice quality) with linguistic analysis (dialect, idiolect) and reports the result as a likelihood ratio.

5.1The Source-Filter Model

Human speech is described by the source-filter model: the larynx (vocal folds) produces a periodic voiced source at fundamental frequency F0; the vocal tract (throat, mouth, nasal cavity) acts as an acoustic filter that selectively reinforces certain frequencies — the formants.

SOURCELarynxF0 = pitchFILTERVocal tractF1, F2, F3 formantsOUTPUTSpeech signal
Fig 5.1Source-filter model: laryngeal periodicity × vocal-tract resonance = speech.

5.2Spectrogram Reading

A spectrogram is a 2-D plot: time on x-axis, frequency on y-axis, intensity by colour / brightness.

F1 ~730 HzF2 ~1090 HzF3 ~2440 HzF4 ~3500 Hztime →frequency →vertical striations = F0 periodicity
Fig 5.2Spectrogram of an adult-male /a/ vowel. Dark horizontal bands = formants; vertical striations = F0 periodicity.

5.3F0 and Formant Reference Values

SpeakerF0 (Hz)F1 /a/F2 /a/F3 /a/
Adult male85–180 (mean ~110–130)~730~1090~2440
Adult female165–255 (mean ~200–220)~850~1220~2810
Children> 250higherhigherhigher

5.4Pitch-Shift Disguise — Detection

Pitch-shift software raises an offender's F0 to disguise their voice (male mimicking female). The defect: basic pitch-shift software raises F0 but does NOT shift formants.

A real female speaker has both higher F0 (~200 Hz) AND higher formants (F1 ~850, F2 ~1220, F3 ~2810 Hz) due to anatomical sex differences. A male speaker pitch-shifted to F0 200 Hz still has male formants (F1 ~730, F2 ~1090, F3 ~2440 Hz) because the vocal-tract length didn't change. The mismatch is detectable on careful spectrogram analysis.

5.5Cepstrum and MFCCs

The cepstrum (Fourier transform of the log-spectrum) separates the source (F0) from the filter (formants). High-quefrency peaks correspond to source periodicity (F0); low-quefrency cepstral coefficients correspond to vocal-tract structure.

MFCCs (Mel-Frequency Cepstral Coefficients) are the cepstrum computed on a perceptually-weighted mel-scale frequency axis. The first 13–39 MFCCs are the standard input feature for automatic speech recognition and automatic speaker recognition.

5.6GSM Telephone Effects

Forensic voice work routinely involves GSM mobile recordings. The classical GSM Full Rate codec samples at 8 kHz (Nyquist 4 kHz) and band-limits to ~300–3400 Hz. F1 and F2 typically survive; F3 may be at the band edge; F4 is lost. Comparison protocol: band-limit the high-quality reference recording to the same 300–3400 Hz range before formant measurement.

5.7Whispered Speech

Whispered speech has no F0 (turbulent airflow source, not vocal-fold vibration). F0-based measures don't apply; formants, voice quality, prosody, and lexical features still discriminate speakers.

5.8Automatic Speaker Recognition Generations

EraTechnologyEER (clean)EER (noisy)
2000sGMM-UBM5–10%15%+
2010si-vector + PLDA2–5%8–12%
2017–2020x-vector (DNN embedding)1–2%5–8%
2020+ECAPA-TDNN + self-supervised< 1%3–5%

5.9Synthetic-Voice Detection (ASVspoof)

The ASVspoof challenge series develops detection systems for synthetic-voice attacks (TTS, voice-cloning, replay). Detection signals: unnatural breathing patterns, over-regular prosody, spectral artefacts, frame-by-frame F0 / formant inconsistencies, ML classifiers trained on real-vs-synthetic pairs. Detection accuracy on contemporaneous synthesis is ~99%; degrades on novel methods.

Memory hooks · Chapter 5

Source-filter: larynx (F0) + vocal tract (formants) = speech. Spectrogram: dark horizontal = formants; vertical striations = F0. F0: male 85–180 Hz, female 165–255 Hz, child > 250 Hz. Pitch-shift disguise: F0 raised, formants not → inconsistent. Cepstrum + MFCCs: separates source from filter. GSM: 300–3400 Hz; F4 lost; band-limit reference to match. Whispered: no F0; formants + voice quality + prosody still work. ASR generations: GMM-UBM → i-vector → x-vector → ECAPA-TDNN.

Don't lose your place

Save this chapter and the rest of Forensic Physics.

A free ForensicSpot account remembers which chapters you've read, lets you highlight passages, take notes and resume from any device.

PreviousMathematics & StatisticsNextVideo Analysis