Speaker Comparison: Methods and the Expert's Role

Forensic speaker comparison evaluates whether a known speaker and an unknown voice in a questioned recording share a common source, using phonetic-auditory analysis, acoustic-parametric measurement, and automatic speaker recognition systems within a likelihood-ratio evidential framework.

Last updated: 19 Jun 2026

Forensic speaker comparison determines whether a questioned voice recording and reference recordings of a named individual share a common source. Examiners apply three complementary methods: phonetic-auditory analysis, acoustic-parametric measurement, and automatic speaker recognition using i-vector or x-vector architectures. Conclusions are expressed as a likelihood ratio rather than a categorical identity statement, indicating how much more probable the observed acoustic evidence is under a same-speaker hypothesis than a different-speaker hypothesis drawn from a relevant population. No "voice fingerprint" exists in a forensic sense; results are always qualified by recording quality, channel conditions, and the degree of vocal disguise or stress present.

A threatening phone call, a ransom demand, an intercepted conversation: whenever a voice in a questioned recording may belong to a known individual, the question of speaker identity is on the table. Forensic speaker comparison is the discipline that answers it, and it is considerably more complex, and more cautious, than popular accounts suggest. There is no voice 'fingerprint' in the sense of a unique, invariant identifier. Voices are shaped by anatomy but also by health, emotion, disguise, recording conditions, and the channel through which the sound travels. A forensic comparison must account for all of it.

Three methodological approaches are in current use, and they are not mutually exclusive. The phonetic-auditory approach applies the trained ear and phonological knowledge of an experienced forensic phonetician to compare pronunciation patterns, voice quality, and prosody. The acoustic-parametric approach quantifies measurable features such as formant frequencies, fundamental frequency statistics, and voice onset time, then compares them statistically. Automatic speaker recognition systems, built on i-vector or x-vector deep-learning architectures, compute an objective similarity score that bypasses human listener effects.

All three approaches now converge on a common evidential framework: the likelihood ratio. Rather than asserting 'this is the same speaker', a well-conducted forensic comparison expresses how much more (or less) probable the observed acoustic data is under a same-speaker hypothesis compared with a different-speaker hypothesis drawn from a relevant population. Each approach has distinct strengths, failure modes, and disclosure requirements that a competent expert must address in any court-bound report.

By the end of this topic you will be able to:

Distinguish the phonetic-auditory, acoustic-parametric, and automatic speaker recognition approaches and state the conditions under which each is most appropriate.
Explain the likelihood-ratio framework for speaker comparison, including the role of the reference population and the verbal scale required by IAFPA guidelines.
Describe how i-vector and x-vector systems work, including the PLDA back-end, and identify the limitations that must be disclosed in forensic casework.
Identify how disguise, stress, telephone bandwidth, and temporal mismatch each degrade comparability and affect the strength of conclusions.
Distinguish a forensic speaker comparison from a witness voice-identification lineup and explain why conflating the two is a methodological error.

Key terms

Forensic speaker comparison: A systematic examination comparing acoustic and phonetic features of a questioned voice recording against known reference recordings of a named individual, expressed in probabilistic or likelihood-ratio terms, not as a categorical identity statement.
Formant frequencies: Resonance frequencies of the vocal tract that shape vowel quality. F1 and F2 (the first and second formants) are the most informative for speaker comparison because they reflect both vocal tract anatomy and learned articulation habits.
Likelihood ratio (LR): The ratio of the probability of the observed evidence given the same-speaker hypothesis to the probability given the different-speaker hypothesis. An LR of 10 means the evidence is 10 times more likely under same-speaker than different-speaker; an LR of 0.1 means it is 10 times more likely under different-speaker.
I-vector / x-vector: Fixed-length mathematical representations of a speech utterance used in automatic speaker recognition. I-vectors are derived from Gaussian mixture model statistics; x-vectors are embeddings learned by a deep neural network. Both are scored by a PLDA back-end.
PLDA (Probabilistic Linear Discriminant Analysis): A statistical back-end model used with i-vector and x-vector systems to compute a similarity score between two utterance representations, normalised for within-speaker variability and between-speaker variability in the training population.
IAFPA: International Association for Forensic Phonetics and Acoustics: the primary professional body for forensic phoneticians and audio analysts, which publishes guidelines on speaker comparison methodology, reporting standards, and competency requirements.

The phonetic-auditory approach

The phonetic-auditory approach predates automated tools by decades and remains valuable in cases where audio quality is too poor for reliable acoustic measurement or where the linguistic content of the recordings matters as much as the acoustic signal. A forensic phonetician listens to both the questioned and reference recordings, attending to four main domains: voice quality (breathiness, creakiness, nasality, laryngeal settings), segmental features (how specific consonants and vowels are produced: the position of the tongue for /r/, the degree of aspiration on /t/, the colouring of vowel formants), prosodic features (habitual pitch range, speech rate, rhythm), and regional and social accent markers.

The auditory approach is trained and systematic, but it is not simple subjective impression. Practitioners work from established phonetic frameworks, and the conclusions they draw are supported by reference to specific time-aligned features in the recordings. The vulnerability of the approach is listener bias: if an examiner knows which recordings are 'supposed to' match, even unconsciously, the evaluation may be influenced. Blind evaluation protocols and independent second examiner review are the standard mitigation.

Acoustic-parametric analysis: formants and prosody

Acoustic-parametric analysis takes measurable quantities from the speech signal and subjects them to statistical comparison. The most widely used features in speaker comparison are formant frequencies, particularly F1 and F2 of vowels, which reflect the shape of the vocal tract and tend to be stable across sessions for a given speaker while showing consistent between-speaker differences. Fundamental frequency (F0, the pitch) statistics, long-term average spectra, and voice onset time are also used.

The Bayesian likelihood-ratio approach as formalised by Philip Rose (2002) and Geoffrey Morrison (2009) provides the statistical architecture. Formant measurements from the questioned and reference recordings are modelled as draws from speaker-specific distributions, and the LR is computed as the probability of observing those measurements from the same speaker divided by the probability from a different speaker drawn from a relevant population database. The choice of that population (which speakers represent 'different speakers'?) is consequential and must be justified in the report.

Likelihood-ratio speaker comparison framework.

Automatic speaker recognition: i-vectors and x-vectors

Modern automatic speaker recognition (ASR) systems used in forensic contexts are built on two main architectures. I-vectors, introduced by Dehak and colleagues around 2011 (Dehak at MIT CSAIL, Kenny and co-authors at CRIM, Montreal), represent a speech utterance as a fixed-length vector derived from a Gaussian mixture model trained on a large corpus of speech. The i-vector captures speaker-specific statistics in a low-dimensional space. Two i-vectors from the same speaker tend to be closer together in this space than two from different speakers.

X-vectors, proposed by Snyder and colleagues in 2018, replace the Gaussian mixture model with a time-delay neural network that learns discriminative features from thousands of hours of labelled training speech. X-vectors generally outperform i-vectors on standard evaluation benchmarks, particularly for short-duration speech. Both architectures use a PLDA (Probabilistic Linear Discriminant Analysis) back-end that converts the similarity score into an LR-calibrated value, accounting for within-speaker variability and intersession channel effects.

Conditions affecting vocal output: disguise, stress, and bandwidth

Speaker comparison is valid only when both the questioned and reference recordings adequately represent the speaker's natural voice under comparable conditions. Several factors degrade this comparability:

Disguise: falsetto, whisper, pitch-shifting, accent imitation, and electronic voice changers can all alter formant patterns, fundamental frequency, and voice quality. Some disguises affect only one or two features; others are pervasive. The examiner must assess and document the degree of disguise and its effect on comparability.
Stress and emotion: high stress raises fundamental frequency and can affect articulation clarity. Recordings of threatening calls or confrontational situations often show elevated F0 relative to neutral reference recordings. This reduces comparability and must be disclosed.
Telephone and codec bandwidth: a narrow-band telephone recording (300-3400 Hz) removes the spectral information above 3400 Hz that some voice-quality and articulation features rely on. Formant comparison is still possible but some features are unavailable. Narrowband vs. wideband reference recordings should not be mixed without careful normalisation.
Temporal mismatch: voice changes with age, illness, and deliberate vocal training. Reference recordings made years before the questioned recording may not represent the same vocal state. The examiner should note any significant temporal gap.

Four recording conditions and which acoustic features each disrupts: formants, fundamental frequency, voice quality, and automated system reliability.

IAFPA guidelines and the strength-of-evidence scale

The IAFPA guidelines on forensic speaker comparison (Code of Practice, 2020) require that conclusions be expressed using a verbal scale tied to likelihood ratios, not as categorical assertions. The standard scale used by many English-language practitioners runs from 'strong support for the proposition that the voices come from different speakers' through 'inconclusive' to 'strong support for the proposition that the voices come from the same speaker', with intermediate gradations of 'moderate', 'limited', and 'very strong' support.

The IAFPA guidelines also specify that the report must state the methods used, the reference population chosen for LR computation, any factors that degraded the comparison (disguise, noise, bandwidth mismatch, short duration), and the qualifications of the examiner. Stating a conclusion without this context is considered inadequate.

LR range	Verbal equivalent	Meaning for same-speaker hypothesis
LR < 0.01	Very strong support for different speakers	Evidence 100x more probable under H2
0.01 <= LR < 0.1	Strong support for different speakers	Evidence 10-100x more probable under H2
0.1 <= LR < 1	Limited to moderate support for different speakers	Evidence moderately more probable under H2
LR = 1	Inconclusive	Evidence equally probable under both hypotheses
1 < LR <= 10	Limited to moderate support for same speaker	Evidence moderately more probable under H1
10 < LR <= 100	Strong support for same speaker	Evidence 10-100x more probable under H1
LR > 100	Very strong support for same speaker	Evidence 100x more probable under H1

Worked example

Comparing a ransom call to a suspect's police interview recording

A real comparison with real constraints: noise, narrow bandwidth, and only 90 seconds of useful speech.

A ransom call is recorded on a landline telephone. The caller speaks for approximately 90 seconds; the recording is narrowband (300-3400 Hz) and shows moderate background noise from what sounds like a vehicle interior. A suspect's voice is available from a 20-minute police interview recorded three months later on a wideband digital recorder.

Comparability assessment. The examiner notes: (a) bandwidth mismatch (questioned = narrowband, reference = wideband); (b) stress: the ransom call shows elevated F0 consistent with arousal; the interview is conversational; (c) 90 seconds of questioned speech is short but usable for formant analysis on high-frequency vowels; (d) no disguise indicators are present in either recording.
Phonetic-auditory pass. The examiner notes that both voices share a British English accent with features consistent with the South West region, including a specific rhoticity pattern in post-vocalic /r/ and a fronted GOAT vowel. No inconsistencies are noted, though the stress-induced pitch elevation in the questioned recording prevents direct F0 comparison.
Acoustic-parametric analysis. Formant measurements are extracted for five vowel classes common to both recordings: the short front vowel in 'cat', the short central vowel in 'cup', the long back vowel in 'car', the short back vowel in 'cot', and the long back rounded vowel in 'thought'. Both recordings are bandpass-filtered to 300-3400 Hz for consistency before measurement. An LR is computed using a British English reference population database.
Automatic system. An x-vector PLDA system trained on an English-language corpus is applied. The system scores the comparison after bandwidth normalisation. The output is converted to an LR using the system's calibration model.
Combined conclusion. The three approaches each contribute a range of LR support. The examiner reports the highest-confidence component (the formant LR) with a verbal conclusion of 'moderate support for the same speaker' and notes the limiting factors: short duration, bandwidth mismatch, and elevated stress in the questioned recording. The two supporting lines of evidence (auditory and automatic) are consistent with this conclusion but are not combined arithmetically.

Check your understanding

Question 1 of 4· 0 answered

What does the likelihood ratio in a speaker comparison represent?

Key Takeaways

Forensic speaker comparison uses three complementary approaches: phonetic-auditory (trained listener), acoustic-parametric (formant and prosodic measurement), and automatic speaker recognition (i-vector or x-vector PLDA systems).
The likelihood-ratio framework, formalised by Rose (2002) and Morrison (2009), replaces categorical identity verdicts with probabilistic expressions of evidential support, grounded in a relevant reference population.
I-vectors and x-vectors are the dominant automatic speaker recognition architectures; x-vectors use deep neural networks and generally outperform i-vectors on short utterances, but both require a PLDA back-end and accurate calibration.
Disguise, stress, telephone bandwidth, and temporal mismatch all reduce comparability and must be disclosed; an examiner's conclusions should be qualified by the quality of the comparison, not presented at face value.
IAFPA guidelines require verbal-scale conclusions, method disclosure, and reference-population justification; conflating laboratory speaker comparison with a witness voice-identification lineup is a fundamental methodological error.

What is the likelihood-ratio framework in forensic speaker comparison?

The likelihood-ratio (LR) framework expresses the evidential value of a speaker comparison as the ratio of two probabilities: how likely the observed acoustic measurements are if the questioned and reference voices come from the same speaker, divided by how likely they are if the voices come from different speakers drawn from a relevant population. An LR greater than 1 supports the same-speaker hypothesis; an LR less than 1 supports different speakers. The framework is probabilistic, not categorical.

What are i-vectors and x-vectors in automatic speaker recognition?

I-vectors (identity vectors, introduced around 2011 by Dehak et al.) are compact fixed-length representations of a speech utterance derived from a probabilistic model of speaker variability. X-vectors are speaker embeddings learned by a deep neural network (proposed by Snyder et al., 2018). Both are combined with a PLDA (Probabilistic Linear Discriminant Analysis) back-end that computes a score comparing two utterances. These systems form the backbone of modern automatic speaker recognition and are now part of the forensic toolkit.

What does IAFPA say about speaker comparison reporting?

The International Association for Forensic Phonetics and Acoustics (IAFPA) guidelines require that speaker comparison reports express conclusions using a verbal scale of likelihood ratios (from 'very strong support' for the same speaker to 'very strong support' for different speakers), not categorical statements such as 'the voices are the same person'. The guidelines also require disclosure of the comparison method, the reference population used, and any factors that reduced the quality of the comparison.

How does vocal disguise affect speaker comparison?

Deliberate disguise (falsetto, whisper, accent imitation, or electronics-altered pitch) can significantly degrade both automatic and human-expert comparison accuracy. Formant patterns remain partially consistent under some disguises but can be substantially altered under others. Examiners must assess and disclose the degree of disguise present in a recording and adjust their confidence accordingly; high levels of disguise may make a meaningful comparison impossible.

What is the difference between speaker comparison and a voice identification lineup?

Forensic speaker comparison is an expert acoustic and phonetic analysis conducted in a laboratory, comparing known reference recordings against a questioned recording, and expressed as a likelihood ratio. A voice identification lineup is an investigative procedure in which a witness listens to several voices and attempts to identify which one they heard, more analogous to a visual lineup. The two serve different purposes and have different validity standards.

Test yourself on Forensic Audio, Video and Image Analysis with free, timed mocks.

Practice Forensic Audio, Video and Image Analysis questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Key Takeaways

Your journey to becoming a forensic professional starts here.