Practice with mock tests, learn from structured notes, and get your questions answered by a global forensic community, all in one place.
Forensic audio enhancement recovers intelligible speech from degraded recordings using noise reduction, adaptive filtering, and bandwidth extension, while strict procedural rules protect the integrity of the original evidence.
Last updated:
A police recording of a threat made over a cheap intercom, a body-worn camera capturing an altercation in a crowded bar, a wiretap interrupted by roadwork outside the window: all of these arrive in the forensic audio laboratory as recordings where the speech is there but not comfortably audible. The goal of forensic audio enhancement is to improve intelligibility without adding anything that was not in the original signal. That last clause is what separates forensic enhancement from broadcast audio production, and it is the hardest constraint to maintain under pressure.
The techniques involved split along the nature of the interference. Stationary noise, the kind that stays spectrally consistent throughout the recording (tape hiss, HVAC rumble, power-supply hum), yields to spectral subtraction and Wiener filtering. Non-stationary interference, noise that changes character moment to moment (a vehicle driving past, a crowd that swells and quietens), requires adaptive filtering strategies that track the interference in real time. And recordings degraded by bandwidth limitation, a phone call compressed to 300-3400 Hz, a walkie-talkie with a narrow audio path, pose a third challenge: bandwidth extension.
Underpinning all of it is an obligation: every processing step must be documented, the original must be preserved unmodified, and the expert must be prepared to explain, in plain language, exactly what was done and what the before-and-after comparison shows. The SWGDE guidelines and ISO standards on audio evidence exist precisely to make that obligation operational rather than aspirational.
The wrong filter on the wrong noise leaves speech worse off than before.
The single most consequential decision in audio enhancement is made before any processing begins: classifying the noise. A spectral subtraction algorithm fed a stationary noise estimate will clean up tape hiss beautifully and will mangle a passing siren. Starting from the wrong classification wastes time and can introduce processing artefacts that make the recording harder to interpret, not easier.
| Noise type | Spectral character | Recommended approach |
|---|---|---|
| Tape hiss / analogue noise floor | Broadband, spectrally flat, stationary | Spectral subtraction or Wiener filter |
| HVAC or fan rumble | Low-frequency, stationary | High-pass filter, then Wiener filter |
| Mains hum (50/60 Hz) | Narrowband, stationary at harmonics | Notch filter at harmonic frequencies |
| Vehicle pass-by | Non-stationary, sweeping spectral peak | Adaptive filtering or multi-band tracking |
| Crowd / babble noise | Non-stationary, broadband | Adaptive noise cancellation if reference available |
| Wind on microphone | Low-frequency burst, non-stationary | Adaptive filter + high-pass gating |
In practice, recordings often contain layered noise: a fixed HVAC background with intermittent vehicle pass-bys. The examiner processes in stages, removing the stationary component first, then addressing the residual non-stationary interference. Each stage is saved as a separate intermediate file so the chain of processing is auditable.
Both methods subtract an estimate of the noise; they differ in how aggressively they do it.
Spectral subtraction, introduced by Boll in 1979, works in the frequency domain. For each short frame of audio, the magnitude spectrum of the noise estimate is subtracted from the magnitude spectrum of the noisy frame. The phase of the original frame is retained, and the result is inverse-transformed back to a time-domain signal. The noise estimate is typically derived from a few seconds of recording where no speech is present (silence before a conversation begins, a pause between sentences).
The main artefact is musical noise: residual tones scattered across the spectrum that sound like random beeping or a distant choir. They arise when the subtraction overshoots (subtracting more noise energy than was actually present in a particular frame) or when the noise estimate does not match the instantaneous noise level. Over-subtraction factors and spectral flooring (not letting the subtracted spectrum go below a small fraction of the noise estimate) reduce musical noise at the cost of leaving some residual background.
The Wiener filter takes a more principled approach. It computes, in each frequency bin, the gain that minimises the mean-squared error between the estimated clean signal and the noisy observation. The gain is driven by the local signal-to-noise ratio: in bins where speech dominates, the gain is close to 1 (pass most of the signal); in bins where noise dominates, the gain is close to 0 (attenuate heavily). The result is perceptually smoother but still depends critically on the quality of the noise power spectral density estimate.
When the noise changes faster than a static estimate can track it, the filter has to move with it.
Adaptive filtering is the standard toolkit for interference that varies over time. The general idea is that instead of estimating noise once from a silent interval, the filter's coefficients are updated continuously as the signal progresses, tracking the changing noise characteristics. The most widely used adaptive algorithm in audio forensics is the least-mean-squares (LMS) filter and its variants, which adjust their tap weights after each new sample to reduce the error between the filter's output and a desired signal.
A reference signal is required: a microphone placed to capture mainly the interfering source (the road noise, the HVAC duct, the generator) with minimal speech capture. The adaptive filter models the path from that reference to the primary microphone and subtracts the estimated interference. This is the approach used in noise-cancelling headsets. In forensic casework a dedicated reference microphone is almost never available because the recording was not made for this purpose. Examiners must instead rely on time-frequency masking or multi-channel analysis if a second channel is available.
A phone call stripped to 300-3400 Hz still contains most of the speech; what it loses matters.
Traditional analogue telephony and many digital codecs (G.711, GSM-FR) limit audio bandwidth to approximately 300-3400 Hz, dropping the low-frequency energy below 300 Hz and the high-frequency consonant energy above 3400 Hz. The missing high-frequency band degrades fricative and affricate discrimination: /s/ and /f/, /ch/ and /sh/ become harder to distinguish. Bandwidth extension attempts to regenerate the missing frequency content from correlates present in the narrowband signal.
Classical bandwidth extension uses the fact that voiced speech harmonics are approximately periodic: if a voiced segment has a fundamental frequency of 120 Hz, harmonics at 240 Hz, 360 Hz, and beyond are mathematically related to the harmonics already present in the 300-3400 Hz band. Techniques based on spectral folding and excitation estimation can reconstruct plausible high-frequency content. More recent machine-learning approaches train on pairs of narrowband and wideband recordings and learn to predict the missing spectral envelope.
Instrumental measures let an examiner benchmark enhancement without recruiting listeners.
Before objective measures existed, evaluating whether an enhancement actually improved intelligibility required listening panels, which are slow, costly, and variable. Two standardised instrumental metrics have largely replaced listening panels for benchmarking:
In practical forensic work, reference signals are rarely available. Examiners instead use informal A-B listening comparisons, documented transcription trials by trained listeners, and STOI estimates computed using noise-estimation-based surrogates for the reference. The goal is a documented, reproducible assessment of whether the enhancement improved the ability to understand speech.
Enhancement is only evidence if every step can be audited and challenged.
The SWGDE Best Practices for Forensic Audio (most recent version 2017) and equivalent guidance from the Audio Engineering Society (AES SC-03-12) require that the following be documented and disclosed for any enhanced recording submitted as evidence:
Which noise reduction method works best for a stationary noise source such as HVAC hum?
Test yourself on Forensic Audio, Video and Image Analysis with free, timed mocks.
Practice Forensic Audio, Video and Image Analysis questionsSpotted an error in this page? Report a correction or read our editorial standards.