Voice Analysis: Spectrography, Speaker Identification and Legal Aspects
UGC-NET Paper 2 Unit VIII notes on voice analysis: vocal apparatus, spectrograms, MFCC and ASR, Ritesh Sinha 2019, Selvi 2010, BSA 2023 and limitations.
Last updated:
Voice analysis closes Unit VIII of the UGC-NET Forensic Science syllabus, sitting next to fingerprints, track marks and biometrics. The bullet asks for four blocks of recall: the anatomy of the human voice apparatus, how a voice spectrogram is read, how a forensic examiner moves from a questioned recording to a speaker-identification opinion, and the Indian legal frame on voice-sample collection and admissibility. NTA likes this topic because it threads cleanly into expert-opinion and electronic-evidence questions under the Bharatiya Sakshya Adhiniyam 2023.
Treat the topic as one anatomy diagram plus one short courtroom story. The anatomy carries the source-filter model, fundamental frequency and formants. The courtroom story carries Selvi 2010, Ritesh Sinha 2019, the lawful-interception frame, and the limitations defence counsel raises in every voice case. Forensic voice analysis is not speech recognition (which transcribes words); it answers the speaker question, "is this the same person?".
- Fundamental frequency (F0)
- Rate of vocal-fold vibration during voiced speech, perceived as pitch. Typical adult male 80 to 180 Hz, adult female 165 to 255 Hz, children above 250 Hz.
- Formants (F1, F2, F3, F4)
- Resonant frequencies of the vocal tract that shape vowel quality. F1 and F2 together distinguish vowels (for example /i/ has low F1 and high F2, /a/ has high F1).
- Source-filter model
- Standard model of voice production: the glottal source (vocal-fold vibration) is filtered by the vocal-tract cavity to produce the speech signal.
- Voice spectrogram (sonogram)
- Time-frequency plot of speech. Wideband (about 300 Hz analysis bandwidth) shows formants; narrowband (about 45 Hz) shows harmonics.
- Voiceprint
- Popular but misleading term coined by Lawrence Kersta in 1962 for an aural-spectrographic comparison. The fingerprint analogy is not scientifically established.
- MFCC
- Mel-Frequency Cepstral Coefficients. The dominant acoustic feature set for automatic speaker recognition.
- FASR
- Forensic Automatic Speaker Recognition. Likelihood-ratio framework using statistical speaker models (GMM, i-vectors, x-vectors).
- Likelihood Ratio (LR)
- Bayesian measure of how much more probable the evidence is under the same-speaker hypothesis than under the different-speaker hypothesis, reported with verbal scales (extremely strong, strong, moderate, equivocal).
Introduction and significance
Voice analysis answers 'who is this speaker?', not 'what did they say?'.
Forensic voice analysis is the use of acoustic features of recorded speech to address forensic questions, most often "is the speaker in recording A the same person as the speaker in recording B?". It is distinct from speech recognition, which transcribes words. For NET MCQs, fix the framing: voice analysis is a speaker problem, speech recognition is a content problem.
The significance set runs to four use cases. Speaker identification on ransom, threat and harassment calls links an anonymous voice to a suspect. Authentication of recorded conversations checks whether an audio file is original or edited, the question raised in many political audio leaks. Speaker profiling extracts gender, approximate age, dialect and emotional state when no suspect is yet on the radar. Disguise detection flags whisper, falsetto, pitch shifters and voice-changer apps. Voice sits alongside fingerprints and biometric systems in Unit VIII because it is a behavioural biometric, which is also why it is more variable than fingerprint or iris evidence.
Structure of the human voice apparatus
Lungs, larynx, vocal tract, articulators.
The voice apparatus has three blocks the syllabus expects you to name in order. Air supply: lungs and diaphragm push air upward to provide the subglottal pressure that drives speech. Phonation: the larynx houses the vocal folds, held apart for unvoiced sounds (/s/, /f/) and brought together to vibrate at the fundamental frequency F0 for voiced sounds. Pitch depends on vocal-fold tension and length, and on subglottal pressure. Adult male F0 is typically 80 to 180 Hz, adult female 165 to 255 Hz, children above 250 Hz. Resonance and articulation: the vocal tract above the larynx (pharynx, oral cavity, nasal cavity) acts as a filter, and the articulators (tongue, lips, jaw, soft palate or velum, teeth) move continuously to change the filter.
The source-filter model combines these blocks. A glottal source spectrum is multiplied by the vocal-tract transfer function to give the radiated speech spectrum at the lips. The peaks of the transfer function are the formants F1, F2, F3 and F4. F1 and F2 together distinguish vowels: /i/ as in "see" has low F1 and high F2, /a/ as in "father" has high F1 and lower F2. Formant patterns are shaped by the speaker's anatomy (tract length, palate shape) and so carry speaker-specific information.
Voice spectrography
Wideband shows formants. Narrowband shows harmonics. Voiceprint is a misnomer.
A voice spectrogram (sonogram) plots time on the x-axis, frequency on the y-axis, and energy as grayscale or colour intensity. Two analysis settings matter for NET. The wideband spectrogram (analysis bandwidth about 300 Hz) has better time resolution and shows dark horizontal formant bands plus vertical striations from individual glottal pulses during voiced speech. The narrowband spectrogram (about 45 Hz) has better frequency resolution and shows individual harmonics of F0. Examiners use wideband for formant tracking and narrowband for pitch and intonation work.
The term voiceprint was coined by Bell Labs engineer Lawrence Kersta in a 1962 paper suggesting visual spectrogram comparisons were as individuating as fingerprints. The analogy stuck in the press but was never established by independent science. The 1979 US National Research Council report criticised the aural-spectrographic method, and the FBI subsequently stopped giving voiceprint testimony. The MCQ correct answer is that "voiceprint" is a misnomer and modern forensic practice prefers likelihood-ratio reporting from automatic systems.
Speaker recognition approaches and the MFCC pipeline
Aural-spectrographic, aural-perceptual, acoustic-phonetic, automatic.
Forensic speaker recognition splits into four method families. (1) Aural-spectrographic (the historical voiceprint method) combines listening with visual spectrogram comparison; subjective, abandoned by FBI and most modern labs. (2) Aural-perceptual uses a trained listener to compare recordings by ear. (3) Acoustic-phonetic uses a phonetician to measure specific features (F0 distribution, formant frequencies and bandwidths, vowel-space areas, articulation rate). (4) Automatic Speaker Recognition (ASR), in the forensic frame FASR, uses software to extract a feature vector and a statistical speaker model. Modern systems use MFCC features and Gaussian Mixture Models (GMM), i-vectors or x-vectors (deep-learning embeddings), with results reported as a likelihood ratio.
The MFCC pipeline is the standard exam diagram: pre-emphasis, framing (20 to 30 ms frames with 10 ms hop), windowing (Hamming), FFT, mel filterbank, log, DCT to give 12 to 13 cepstral coefficients per frame, usually augmented with delta and delta-delta coefficients.
The three speaker-task terms NTA mixes up in MCQs are worth pinning down. Speaker verification (1:1) asks "is this the claimed speaker?", used in banking voice biometrics. Speaker identification, closed-set (1:N) asks "which of N enrolled speakers is this?". Speaker identification, open-set asks "is this one of N enrolled speakers, or someone else?", which is the realistic forensic case.
Indian legal aspects
Selvi 2010, Ritesh Sinha 2019, BSA 2023 Section 39, and the lawful-interception frame.
Selvi v. State of Karnataka (2010, Supreme Court). Narco-analysis, polygraph and brain-mapping cannot be performed on an accused without consent; involuntary administration violates Article 20(3) (self-incrimination) and Article 21 (privacy). Selvi did not deal with voice samples directly but framed the constitutional limits on compelled physiological evidence.
Ritesh Sinha v. State of Uttar Pradesh (2019, Supreme Court). A Judicial Magistrate has the power to direct an accused to give a voice sample for investigation, even without an express statutory provision at the time. The court drew the analogy with fingerprints and handwriting under Section 73 of the Indian Evidence Act 1872 (now Section 348 of the BNSS 2023) and held that taking a voice exemplar does not violate Article 20(3). Voice samples for comparison are admissible. This is the single most important MCQ-grade case on the topic.
BSA 2023 Section 39. Expert opinion on matters of "science or art" covers forensic voice analysis. The framework for admissibility of forensic evidence under the BSA 2023 governs how the analyst's report and cross-examination are handled.
Electronic-evidence frame. Audio recordings are electronic records. BSA 2023 Sections 61 and 63 (carrying forward IEA Sections 65A and 65B) require the certificate covering the device that produced or copied the recording, in keeping with the BNS and BSA 2023 framework for electronic evidence.
Lawful interception. The Indian Telegraph Act 1885 Section 5(2) allows interception on order of the Home Secretary or designated officer in cases of public emergency or public safety. The IT Act 2000 Section 69 extends the regime to electronic communications. PUCL v. Union of India (1997) laid down procedural safeguards (reasons in writing, review committee, two-month review, destruction of records). Without these the recording is liable to be excluded, and the
Limitations
Recording quality, speaker variation, disguise, short utterance.
NTA exploits the long limitations list by asking "name four limitations" in short-answer form.
- Recording quality and channel mismatch. Telephone audio is bandlimited to about 4 kHz; broadband audio reaches 8 to 22 kHz. GSM and modern codecs throw away high frequencies. Comparing a telephone questioned recording with a broadband exemplar biases the comparison.
- Speaker variation. Emotional state, illness, intoxication, time of day and age change a voice. Intra-speaker variability can exceed inter-speaker variability for similar-sounding speakers.
- Disguise. Pitch shifting, falsetto, whispered speech, accent imitation and consumer voice-changer apps reduce the strength of any same-speaker conclusion.
- Short utterance length. Less than about 30 seconds of net speech gives unreliable comparison; ransom calls are often too short.
- Cross-language samples. Comparing an English questioned recording with a Hindi or regional-language exemplar is problematic because acoustic distributions and reference models change with language.
- Aural-spectrographic limits. The historical voiceprint method is not as individuating as fingerprints; the 1979 NRC report ended FBI voiceprint testimony.
- Deepfake and synthetic voice. Voice-cloning systems trained on a few seconds of target audio now produce convincing fakes; analysts check for synthesis artefacts as a routine step.
Practical Indian context that has brought voice evidence into public view includes bank-call frauds, ransom calls (phone-tap recordings used during the Nirbhaya investigation phase), and politically charged audio leaks such as the Niira Radia tapes (2010), which raised procedural and admissibility questions more than identity disputes. Each case reinforces the point that voice evidence rises and falls on chain-of-custody and lawful-interception paperwork as much as on the acoustic analysis.