Voice Analysis: Spectrography, Speaker Identification and Legal Aspects
Voice analysis: vocal apparatus, spectrograms, MFCC and ASR, Ritesh Sinha 2019, Selvi 2010, BSA 2023 and limitations.
Last updated:
Forensic voice analysis uses acoustic features of recorded speech to determine speaker identity, answering "is the speaker in recording A the same person as recording B?" rather than transcribing words. The core tools are the voice spectrogram (which plots time, frequency, and energy to reveal formants and harmonics), Mel-Frequency Cepstral Coefficients (MFCCs) for automatic speaker modelling, and a likelihood-ratio framework for reporting results. Under Indian law, the Supreme Court in Ritesh Sinha v. State of UP (2019) confirmed that a Judicial Magistrate may order an accused to provide a voice sample without violating the right against self-incrimination, and forensic voice analysis findings are admissible as expert opinion under Section 39 of the Bharatiya Sakshya Adhiniyam 2023.
Forensic voice analysis sits alongside fingerprints, track marks, and biometric systems as a means of linking a person to a recorded event. The four pillars of the subject are: the anatomy of the human voice apparatus, how a voice spectrogram is read, how a forensic examiner moves from a questioned recording to a speaker-identification opinion, and the Indian legal framework on voice-sample collection and admissibility under the Bharatiya Sakshya Adhiniyam 2023.
The technical foundation is the source-filter model of voice production, which explains fundamental frequency and formants. The legal dimension covers Selvi 2010, Ritesh Sinha 2019, the lawful-interception framework, and the limitations that defence counsel routinely raises in voice cases. Forensic voice analysis is not speech recognition, which transcribes words; it answers the speaker question: "is this the same person?"
By the end of this topic you will be able to:
- Describe the three-block structure of the human voice apparatus (air supply, phonation, resonance/articulation) and explain how the source-filter model links them to formants F1-F4.
- Read a voice spectrogram and distinguish wideband from narrowband analysis in terms of what each reveals about formants and harmonics.
- Compare the four forensic speaker-recognition method families (aural-spectrographic, aural-perceptual, acoustic-phonetic, automatic) and explain why modern practice reports results as a likelihood ratio.
- State the holdings of Selvi v. State of Karnataka (2010) and Ritesh Sinha v. State of UP (2019) and identify the relevant provisions of the BSA 2023 and the lawful-interception statutes.
- List the principal technical and practical limitations that can weaken or invalidate a forensic voice comparison.
- Fundamental frequency (F0)
- Rate of vocal-fold vibration during voiced speech, perceived as pitch. Typical adult male 80 to 180 Hz, adult female 165 to 255 Hz, children above 250 Hz.
- Formants (F1, F2, F3, F4)
- Resonant frequencies of the vocal tract that shape vowel quality. F1 and F2 together distinguish vowels (for example /i/ has low F1 and high F2, /a/ has high F1).
- Source-filter model
- Standard model of voice production: the glottal source (vocal-fold vibration) is filtered by the vocal-tract cavity to produce the speech signal.
- Voice spectrogram (sonogram)
- Time-frequency plot of speech. Wideband (about 300 Hz analysis bandwidth) shows formants; narrowband (about 45 Hz) shows harmonics.
- Voiceprint
- Popular but misleading term coined by Lawrence Kersta in 1962 for an aural-spectrographic comparison. The fingerprint analogy is not scientifically established.
- MFCC
- Mel-Frequency Cepstral Coefficients. The dominant acoustic feature set for automatic speaker recognition.
- FASR
- Forensic Automatic Speaker Recognition. Likelihood-ratio framework using statistical speaker models (GMM, i-vectors, x-vectors).
- Likelihood Ratio (LR)
- Bayesian measure of how much more probable the evidence is under the same-speaker hypothesis than under the different-speaker hypothesis, reported with verbal scales (extremely strong, strong, moderate, equivocal).
Introduction and significance
Forensic voice analysis is the use of acoustic features of recorded speech to address forensic questions, most often "is the speaker in recording A the same person as the speaker in recording B?". It is distinct from speech recognition, which transcribes words. MCQs, fix the framing: voice analysis is a speaker problem, speech recognition is a content problem.
The subject covers four principal use cases. Speaker identification on ransom, threat and harassment calls links an anonymous voice to a suspect. Authentication of recorded conversations checks whether an audio file is original or edited, the question raised in many political audio leaks. Speaker profiling extracts gender, approximate age, dialect and emotional state when no suspect is yet on the radar. Disguise detection flags whisper, falsetto, pitch shifters and voice-changer apps. Voice sits alongside fingerprints and biometric systems in because it is a behavioural biometric which is also why it is more variable than fingerprint or iris evidence.
Structure of the human voice apparatus
The voice apparatus has three functional blocks.Air supply lungs and diaphragm push air upward to provide the subglottal pressure that drives speech.Phonation the larynx houses the vocal folds, held apart for unvoiced sounds (/s/, /f/) and brought together to vibrate at the fundamental frequency F0 for voiced sounds. Pitch depends on vocal-fold tension and length, and on subglottal pressure. Adult male F0 is typically 80 to 180 Hz, adult female 165 to 255 Hz, children above 250 Hz.Resonance and articulation the vocal tract above the larynx (pharynx, oral cavity, nasal cavity) acts as a filter, and the articulators (tongue, lips, jaw, soft palate or velum, teeth) move continuously to change the filter.
The source-filter model combines these blocks. A glottal source spectrum is multiplied by the vocal-tract transfer function to give the radiated speech spectrum at the lips. The peaks of the transfer function are the formants F1, F2, F3 and F4. F1 and F2 together distinguish vowels: /i/ as in "see" has low F1 and high F2, /a/ as in "father" has high F1 and lower F2. Formant patterns are shaped by the speaker's anatomy (tract length, palate shape) and so carry speaker-specific information.

Voice spectrography
A voice spectrogram (sonogram) plots time on the x-axis, frequency on the y-axis, and energy as grayscale or colour intensity. Two analysis settings matter. The wideband spectrogram(analysis bandwidth about 300 Hz) has better time resolution and shows dark horizontal formant bands plus vertical striations from individual glottal pulses during voiced speech. The narrowband spectrogram(about 45 Hz) has better frequency resolution and shows individual harmonics of F0. Examiners use wideband for formant tracking and narrowband for pitch and intonation work.
The term voiceprint was coined by Bell Labs engineer Lawrence Kersta in a 1962 paper suggesting visual spectrogram comparisons were as individuating as fingerprints. The analogy stuck in the press but was never established by independent science. The 1979 US National Research Council report criticised the aural-spectrographic method, concluding that forensic applications should be approached with great caution. The FBI continued using spectrographic voice comparison as investigative guidance after that report, and courts progressively rejected voiceprint testimony on admissibility grounds. The correct answer is that "voiceprint" is a misnomer and modern forensic practice prefers likelihood-ratio reporting from automatic systems.
Speaker recognition approaches and the MFCC pipeline
Forensic speaker recognition splits into four method families. (1)Aural-spectrographic(the historical voiceprint method) combines listening with visual spectrogram comparison; subjective, abandoned by FBI and most modern labs. (2)Aural-perceptual uses a trained listener to compare recordings by ear. (3)Acoustic-phonetic uses a phonetician to measure specific features (F0 distribution, formant frequencies and bandwidths, vowel-space areas, articulation rate). (4)Automatic Speaker Recognition (ASR)in the forensic frame FASR uses software to extract a feature vector and a statistical speaker model. Modern systems use MFCC features and Gaussian Mixture Models (GMM)i-vectors or x-vectors(deep-learning embeddings), with results reported as a likelihood ratio.
The MFCC pipeline is the standard exam diagram:pre-emphasisframing(20 to 30 ms frames with 10 ms hop),windowing(Hamming),FFTmel filterbanklogDCT to give 12 to 13 cepstral coefficients per frame, usually augmented with delta and delta-delta coefficients.
Three speaker-task terms are frequently conflated and are worth distinguishing precisely.Speaker verification (1:1)asks "is this the claimed speaker?", used in banking voice biometrics.Speaker identification, closed-set (1:N)asks "which of N enrolled speakers is this?".Speaker identification, open-set asks "is this one of N enrolled speakers, or someone else?", which is the realistic forensic case.

Modern reporting avoids categorical "identification" claims and uses verbal-scale LR statements, for example "the evidence provides strong support for the same-speaker hypothesis". Standards come from IAFPA(International Association for Forensic Phonetics and Acoustics) for ethics, ENFSI speaker-comparison guidelines for the LR framework, and ISO/IEC 19794-13 for the speech-data interchange format. Tools used in Indian and global labs include Praat(free, University of Amsterdam),Batvox(Agnitio),PhonexiaSpeaker Plus and ASIS-V. Commercial voice biometrics like HSBC Voice ID share the underlying pipeline but are tuned for verification, not forensic comparison.
Indian legal aspects
Selvi v. State of Karnataka (2010, Supreme Court). Narco-analysis, polygraph and brain-mapping cannot be performed on an accused without consent; involuntary administration violates Article 20(3) (self-incrimination) and Article 21 (privacy). Selvi did not deal with voice samples directly but framed the constitutional limits on compelled physiological evidence.
Ritesh Sinha v. State of Uttar Pradesh (2019, Supreme Court). A Judicial Magistrate has the power to direct an accused to give a voice sample for investigation, even without an express statutory provision at the time. The court drew the analogy with fingerprints and handwriting under Section 73 of the Indian Evidence Act 1872 (now Section 349 of the BNSS 2023) and held that taking a voice exemplar does not violate Article 20(3). Voice samples for comparison are admissible. This is the single most important -grade case on the topic.
BSA 2023 Section 39.Expert opinion on matters of "science or art" covers forensic voice analysis. The framework for admissibility of forensic evidence under the BSA 2023governs how the analyst's report and cross-examination are handled.
Electronic-evidence frame. Audio recordings are electronic records.BSA 2023 Sections 61 and 63(carrying forward IEA Sections 65A and 65B) require the certificate covering the device that produced or copied the recording, in keeping with the BNS and BSA 2023 framework for electronic evidence.
Lawful interception. The Indian Telegraph Act 1885 Section 5(2)allows interception on order of the Home Secretary or designated officer in cases of public emergency or public safety. The IT Act 2000 Section 69 extends the regime to electronic communications.PUCL v. Union of India (1997)laid down procedural safeguards (reasons in writing, review committee, two-month review, destruction of records). Without these the recording is liable to be excluded, and the chain of custodyfor the seized audio is the first thing the defence will probe.
Institutional anchors:CFSL Hyderabad houses a dedicated audio-forensics unit, CFSL Chandigarh has audio-forensics capability, and NFSU Gandhinagar runs an audio-forensics teaching and casework programme. The IIT Madras speech-processing group is a leading academic centre for speaker-recognition research in India.
Limitations
- Recording quality and channel mismatch. Telephone audio is bandlimited to about 4 kHz; broadband audio reaches 8 to 22 kHz. GSM and modern codecs throw away high frequencies. Comparing a telephone questioned recording with a broadband exemplar biases the comparison.
- Speaker variation. Emotional state, illness, intoxication, time of day and age change a voice. Intra-speaker variability can exceed inter-speaker variability for similar-sounding speakers.
- Disguise. Pitch shifting, falsetto, whispered speech, accent imitation and consumer voice-changer apps reduce the strength of any same-speaker conclusion.
- Short utterance length. Less than about 30 seconds of forensic practice speech gives unreliable comparison; ransom calls are often too short.
- Cross-language samples. Comparing an English questioned recording with a Hindi or regional-language exemplar is problematic because acoustic distributions and reference models change with language.
- Aural-spectrographic limits. The historical voiceprint method is not as individuating as fingerprints; the 1979 NRC report ended FBI voiceprint testimony.
- Deepfake and synthetic voice. Voice-cloning systems trained on a few seconds of target audio now produce convincing fakes; analysts check for synthesis artefacts as a routine step.
Practical Indian context that has brought voice evidence into public view includes bank-call frauds, ransom calls (phone-tap recordings used during the Nirbhaya investigation phase), and politically charged audio leaks such as the Niira Radia tapes (2010), which raised procedural and admissibility questions more than identity disputes. Each case reinforces the point that voice evidence rises and falls on chain-of-custody and lawful-interception paperwork as much as on the acoustic analysis.
Is forensic voice analysis the same as speech recognition?
What did the Supreme Court hold in Ritesh Sinha v. State of UP (2019) about voice samples?
What is the difference between a wideband and a narrowband voice spectrogram?
Why is the term 'voiceprint' considered misleading?
What are the main limitations of forensic voice analysis that examiners test?
Test yourself on UGC-NET Forensic Science with free, timed mocks.
Practice UGC-NET Forensic Science questionsSpotted an error in this page? Report a correction or read our editorial standards.