Skip to content

Voice Analysis: Spectrography, Speaker Identification and Legal Aspects

Voice analysis: vocal apparatus, spectrograms, MFCC and ASR, Ritesh Sinha 2019, Selvi 2010, BSA 2023 and limitations.

Last updated:

Share

Forensic voice analysis uses acoustic features of recorded speech to determine speaker identity, answering "is the speaker in recording A the same person as recording B?" rather than transcribing words. The core tools are the voice spectrogram (which plots time, frequency, and energy to reveal formants and harmonics), Mel-Frequency Cepstral Coefficients (MFCCs) for automatic speaker modelling, and a likelihood-ratio framework for reporting results. Under Indian law, the Supreme Court in Ritesh Sinha v. State of UP (2019) confirmed that a Judicial Magistrate may order an accused to provide a voice sample without violating the right against self-incrimination, and forensic voice analysis findings are admissible as expert opinion under Section 39 of the Bharatiya Sakshya Adhiniyam 2023.

Forensic voice analysis sits alongside fingerprints, track marks, and biometric systems as a means of linking a person to a recorded event. The four pillars of the subject are: the anatomy of the human voice apparatus, how a voice spectrogram is read, how a forensic examiner moves from a questioned recording to a speaker-identification opinion, and the Indian legal framework on voice-sample collection and admissibility under the Bharatiya Sakshya Adhiniyam 2023.

The technical foundation is the source-filter model of voice production, which explains fundamental frequency and formants. The legal dimension covers Selvi 2010, Ritesh Sinha 2019, the lawful-interception framework, and the limitations that defence counsel routinely raises in voice cases. Forensic voice analysis is not speech recognition, which transcribes words; it answers the speaker question: "is this the same person?"

By the end of this topic you will be able to:

  • Describe the three-block structure of the human voice apparatus (air supply, phonation, resonance/articulation) and explain how the source-filter model links them to formants F1-F4.
  • Read a voice spectrogram and distinguish wideband from narrowband analysis in terms of what each reveals about formants and harmonics.
  • Compare the four forensic speaker-recognition method families (aural-spectrographic, aural-perceptual, acoustic-phonetic, automatic) and explain why modern practice reports results as a likelihood ratio.
  • State the holdings of Selvi v. State of Karnataka (2010) and Ritesh Sinha v. State of UP (2019) and identify the relevant provisions of the BSA 2023 and the lawful-interception statutes.
  • List the principal technical and practical limitations that can weaken or invalidate a forensic voice comparison.
Key terms
Fundamental frequency (F0)
Rate of vocal-fold vibration during voiced speech, perceived as pitch. Typical adult male 80 to 180 Hz, adult female 165 to 255 Hz, children above 250 Hz.
Formants (F1, F2, F3, F4)
Resonant frequencies of the vocal tract that shape vowel quality. F1 and F2 together distinguish vowels (for example /i/ has low F1 and high F2, /a/ has high F1).
Source-filter model
Standard model of voice production: the glottal source (vocal-fold vibration) is filtered by the vocal-tract cavity to produce the speech signal.
Voice spectrogram (sonogram)
Time-frequency plot of speech. Wideband (about 300 Hz analysis bandwidth) shows formants; narrowband (about 45 Hz) shows harmonics.
Voiceprint
Popular but misleading term coined by Lawrence Kersta in 1962 for an aural-spectrographic comparison. The fingerprint analogy is not scientifically established.
MFCC
Mel-Frequency Cepstral Coefficients. The dominant acoustic feature set for automatic speaker recognition.
FASR
Forensic Automatic Speaker Recognition. Likelihood-ratio framework using statistical speaker models (GMM, i-vectors, x-vectors).
Likelihood Ratio (LR)
Bayesian measure of how much more probable the evidence is under the same-speaker hypothesis than under the different-speaker hypothesis, reported with verbal scales (extremely strong, strong, moderate, equivocal).

Introduction and significance

Forensic voice analysis is the use of acoustic features of recorded speech to address forensic questions, most often "is the speaker in recording A the same person as the speaker in recording B?". It is distinct from speech recognition, which transcribes words. MCQs, fix the framing: voice analysis is a speaker problem, speech recognition is a content problem.

The subject covers four principal use cases. Speaker identification on ransom, threat and harassment calls links an anonymous voice to a suspect. Authentication of recorded conversations checks whether an audio file is original or edited, the question raised in many political audio leaks. Speaker profiling extracts gender, approximate age, dialect and emotional state when no suspect is yet on the radar. Disguise detection flags whisper, falsetto, pitch shifters and voice-changer apps. Voice sits alongside fingerprints and biometric systems in because it is a behavioural biometric which is also why it is more variable than fingerprint or iris evidence.

Structure of the human voice apparatus

The voice apparatus has three functional blocks.Air supply lungs and diaphragm push air upward to provide the subglottal pressure that drives speech.Phonation the larynx houses the vocal folds, held apart for unvoiced sounds (/s/, /f/) and brought together to vibrate at the fundamental frequency F0 for voiced sounds. Pitch depends on vocal-fold tension and length, and on subglottal pressure. Adult male F0 is typically 80 to 180 Hz, adult female 165 to 255 Hz, children above 250 Hz.Resonance and articulation the vocal tract above the larynx (pharynx, oral cavity, nasal cavity) acts as a filter, and the articulators (tongue, lips, jaw, soft palate or velum, teeth) move continuously to change the filter.

The source-filter model combines these blocks. A glottal source spectrum is multiplied by the vocal-tract transfer function to give the radiated speech spectrum at the lips. The peaks of the transfer function are the formants F1, F2, F3 and F4. F1 and F2 together distinguish vowels: /i/ as in "see" has low F1 and high F2, /a/ as in "father" has high F1 and lower F2. Formant patterns are shaped by the speaker's anatomy (tract length, palate shape) and so carry speaker-specific information.

Air from lungs drives vocal-fold vibration at F0 in the larynx; the vocal-tract cavity filters the source spectrum, and the a
Air from lungs drives vocal-fold vibration at F0 in the larynx; the vocal-tract cavity filters the source spectrum, and the articulators reshape the filter to produce vowels and consonants.

Voice spectrography

A voice spectrogram (sonogram) plots time on the x-axis, frequency on the y-axis, and energy as grayscale or colour intensity. Two analysis settings matter. The wideband spectrogram(analysis bandwidth about 300 Hz) has better time resolution and shows dark horizontal formant bands plus vertical striations from individual glottal pulses during voiced speech. The narrowband spectrogram(about 45 Hz) has better frequency resolution and shows individual harmonics of F0. Examiners use wideband for formant tracking and narrowband for pitch and intonation work.

The term voiceprint was coined by Bell Labs engineer Lawrence Kersta in a 1962 paper suggesting visual spectrogram comparisons were as individuating as fingerprints. The analogy stuck in the press but was never established by independent science. The 1979 US National Research Council report criticised the aural-spectrographic method, concluding that forensic applications should be approached with great caution. The FBI continued using spectrographic voice comparison as investigative guidance after that report, and courts progressively rejected voiceprint testimony on admissibility grounds. The correct answer is that "voiceprint" is a misnomer and modern forensic practice prefers likelihood-ratio reporting from automatic systems.

Speaker recognition approaches and the MFCC pipeline

Forensic speaker recognition splits into four method families. (1)Aural-spectrographic(the historical voiceprint method) combines listening with visual spectrogram comparison; subjective, abandoned by FBI and most modern labs. (2)Aural-perceptual uses a trained listener to compare recordings by ear. (3)Acoustic-phonetic uses a phonetician to measure specific features (F0 distribution, formant frequencies and bandwidths, vowel-space areas, articulation rate). (4)Automatic Speaker Recognition (ASR)in the forensic frame FASR uses software to extract a feature vector and a statistical speaker model. Modern systems use MFCC features and Gaussian Mixture Models (GMM)i-vectors or x-vectors(deep-learning embeddings), with results reported as a likelihood ratio.

The MFCC pipeline is the standard exam diagram:pre-emphasisframing(20 to 30 ms frames with 10 ms hop),windowing(Hamming),FFTmel filterbanklogDCT to give 12 to 13 cepstral coefficients per frame, usually augmented with delta and delta-delta coefficients.

Three speaker-task terms are frequently conflated and are worth distinguishing precisely.Speaker verification (1:1)asks "is this the claimed speaker?", used in banking voice biometrics.Speaker identification, closed-set (1:N)asks "which of N enrolled speakers is this?".Speaker identification, open-set asks "is this one of N enrolled speakers, or someone else?", which is the realistic forensic case.

Questioned and known recordings are framed and feature-extracted to MFCC vectors, modelled with a speaker statistical model,
Questioned and known recordings are framed and feature-extracted to MFCC vectors, modelled with a speaker statistical model, then compared and reported as a likelihood ratio with a verbal-scale conclusion.

Modern reporting avoids categorical "identification" claims and uses verbal-scale LR statements, for example "the evidence provides strong support for the same-speaker hypothesis". Standards come from IAFPA(International Association for Forensic Phonetics and Acoustics) for ethics, ENFSI speaker-comparison guidelines for the LR framework, and ISO/IEC 19794-13 for the speech-data interchange format. Tools used in Indian and global labs include Praat(free, University of Amsterdam),Batvox(Agnitio),PhonexiaSpeaker Plus and ASIS-V. Commercial voice biometrics like HSBC Voice ID share the underlying pipeline but are tuned for verification, not forensic comparison.

Limitations

  • Recording quality and channel mismatch. Telephone audio is bandlimited to about 4 kHz; broadband audio reaches 8 to 22 kHz. GSM and modern codecs throw away high frequencies. Comparing a telephone questioned recording with a broadband exemplar biases the comparison.
  • Speaker variation. Emotional state, illness, intoxication, time of day and age change a voice. Intra-speaker variability can exceed inter-speaker variability for similar-sounding speakers.
  • Disguise. Pitch shifting, falsetto, whispered speech, accent imitation and consumer voice-changer apps reduce the strength of any same-speaker conclusion.
  • Short utterance length. Less than about 30 seconds of forensic practice speech gives unreliable comparison; ransom calls are often too short.
  • Cross-language samples. Comparing an English questioned recording with a Hindi or regional-language exemplar is problematic because acoustic distributions and reference models change with language.
  • Aural-spectrographic limits. The historical voiceprint method is not as individuating as fingerprints; the 1979 NRC report ended FBI voiceprint testimony.
  • Deepfake and synthetic voice. Voice-cloning systems trained on a few seconds of target audio now produce convincing fakes; analysts check for synthesis artefacts as a routine step.

Practical Indian context that has brought voice evidence into public view includes bank-call frauds, ransom calls (phone-tap recordings used during the Nirbhaya investigation phase), and politically charged audio leaks such as the Niira Radia tapes (2010), which raised procedural and admissibility questions more than identity disputes. Each case reinforces the point that voice evidence rises and falls on chain-of-custody and lawful-interception paperwork as much as on the acoustic analysis.

Is forensic voice analysis the same as speech recognition?
No. Speech recognition transcribes words (a content question). Forensic voice analysis identifies the speaker or extracts speaker information (a who-said-it question). Both use acoustic features and signal-processing tools like MFCCs, but they answer different questions. MCQs the distinction is examinable: pick speaker identification, not transcription, as the forensic application.
What did the Supreme Court hold in Ritesh Sinha v. State of UP (2019) about voice samples?
A Judicial Magistrate has the power to direct an accused to give a voice sample for investigation, even without an express statutory provision at the time. The court drew the analogy with fingerprints and handwriting under Section 73 of the Indian Evidence Act (now Section 348 of the BNSS 2023) and held that taking a voice exemplar does not violate the right against self-incrimination under Article 20(3). Voice samples obtained on such orders are admissible.
What is the difference between a wideband and a narrowband voice spectrogram?
A wideband spectrogram uses about 300 Hz analysis bandwidth; it gives good time resolution and clearly shows formants and vertical striations from individual glottal pulses. A narrowband spectrogram uses about 45 Hz bandwidth; it gives good frequency resolution and shows individual harmonics of F0. Forensic examiners use wideband for formant tracking and narrowband for pitch and intonation analysis.
Why is the term 'voiceprint' considered misleading?
Lawrence Kersta coined 'voiceprint' in 1962 with an implied fingerprint analogy, suggesting that a visual spectrogram comparison could individuate speakers in the way a fingerprint does. That uniqueness claim was never scientifically established. The 1979 US National Research Council report criticised the aural-spectrographic method and the FBI subsequently stopped giving voiceprint testimony. Modern forensic practice uses likelihood-ratio reporting from automatic systems (MFCC features plus GMM, i-vector or x-vector models).
What are the main limitations of forensic voice analysis that examiners test?
Four limitations form the standard short-answer set. First, recording quality and channel mismatch: telephone audio is bandlimited to about 4 kHz and codecs like GSM destroy high frequencies. Second, speaker variation: emotional state, illness, intoxication and age change a voice. Third, disguise: pitch shifting, falsetto and voice-changer apps reduce comparison strength. Fourth, short utterance length: less than about 30 seconds of forensic practice speech gives unreliable results. Cross-language samples, the limits of the historical voiceprint method, and emerging deepfake voice cloning extend this list.

Test yourself on UGC-NET Forensic Science with free, timed mocks.

Practice UGC-NET Forensic Science questions

Found this useful? Pass it along.

Share

Spotted an error in this page? Report a correction or read our editorial standards.

Your journey to becoming a forensic professional starts here.

Practice with mock tests, learn from structured notes, and get your questions answered by a global forensic community, all in one place.