Audio Enhancement and Speech Intelligibility

Forensic audio enhancement recovers intelligible speech from degraded recordings using noise reduction, adaptive filtering, and bandwidth extension, while strict procedural rules protect the integrity of the original evidence.

Last updated: 19 Jun 2026

Forensic audio enhancement applies noise reduction, adaptive filtering, and bandwidth extension to degraded recordings in order to improve the intelligibility of speech that is already present in the signal. The three principal noise categories, stationary (tape hiss, HVAC rumble), non-stationary (crowd noise, passing vehicles), and bandwidth-limited (telephone, codec-compressed), each require distinct processing strategies. Enhancement does not create or reconstruct speech content; it can only reveal what the signal already contains. Every processing step must be documented, the original preserved unmodified, and the results disclosed in full, as required by SWGDE best-practice guidelines and ISO standards on audio evidence.

A police recording of a threat made over a cheap intercom, a body-worn camera capturing an altercation in a crowded bar, a wiretap interrupted by roadwork outside the window: all of these arrive in the forensic audio laboratory as recordings where the speech is there but not comfortably audible. The goal of forensic audio enhancement is to improve intelligibility without adding anything that was not in the original signal. That last clause is what separates forensic enhancement from broadcast audio production, and it is the hardest constraint to maintain under pressure.

The techniques involved split along the nature of the interference. Stationary noise, the kind that stays spectrally consistent throughout the recording (tape hiss, HVAC rumble, power-supply hum), yields to spectral subtraction and Wiener filtering. Non-stationary interference, noise that changes character moment to moment (a vehicle driving past, a crowd that swells and quietens), requires adaptive filtering strategies that track the interference in real time. And recordings degraded by bandwidth limitation, a phone call compressed to 300-3400 Hz, a walkie-talkie with a narrow audio path, pose a third challenge: bandwidth extension.

Underpinning all of it is an obligation: every processing step must be documented, the original must be preserved unmodified, and the expert must be prepared to explain, in plain language, exactly what was done and what the before-and-after comparison shows. The SWGDE guidelines and ISO standards on audio evidence exist precisely to make that obligation operational rather than aspirational.

By the end of this topic you will be able to:

Classify a noise source as stationary or non-stationary and select the appropriate processing method for each type.
Explain how spectral subtraction and Wiener filtering work, including the musical-noise artefact risk associated with spectral subtraction.
Describe the role of a reference channel in adaptive noise cancellation and identify substitutes available for single-channel casework recordings.
Interpret PESQ and STOI scores and state the conditions under which each measure can and cannot be applied to a forensic recording.
List the documentation and disclosure obligations imposed by SWGDE guidelines for any enhanced recording submitted as evidence.

Key terms

Stationary noise: Noise whose spectral characteristics remain approximately constant over the duration of a recording. Tape hiss, HVAC fan noise, and mains hum are typical examples. Stationary noise is tractable by spectral subtraction and Wiener filtering.
Non-stationary noise: Interference whose spectral content changes over time: a vehicle passing, a crowd noise that rises and falls, or a door slamming. Adaptive filtering is required because a fixed noise estimate based on a single silent interval will not model the interference throughout the recording.
Spectral subtraction: An enhancement method that estimates the noise spectrum during a speech-free interval and subtracts it from every subsequent frame. Simple and effective for stationary noise, but prone to musical noise artefacts when the noise estimate is imperfect.
Wiener filter: A minimum mean-squared-error filter that computes a frequency-dependent gain function from the signal-to-noise ratio in each frequency band. Less aggressive than spectral subtraction, with fewer musical noise artefacts but similar dependence on a good noise estimate.
PESQ / STOI: Perceptual Evaluation of Speech Quality and Short-Time Objective Intelligibility: instrumental measures that predict listener intelligibility without needing human subjects. STOI scores between 0 and 1 correlate well with the fraction of words correctly understood in noise.
SWGDE: Scientific Working Group for Digital Evidence: a US-based body that publishes best-practice guidelines for forensic audio, video, and image examination. Its audio enhancement guidelines require original preservation, processing disclosure, and working-copy labelling.

Classifying noise before choosing a method

The most consequential decision in audio enhancement is made before any processing begins: classifying the noise. A spectral subtraction algorithm fed a stationary noise estimate removes tape hiss reliably and distorts a passing siren. The wrong method on the wrong noise type introduces processing artefacts that reduce intelligibility rather than improving it.

Noise type	Spectral character	Recommended approach
Tape hiss / analogue noise floor	Broadband, spectrally flat, stationary	Spectral subtraction or Wiener filter
HVAC or fan rumble	Low-frequency, stationary	High-pass filter, then Wiener filter
Mains hum (50/60 Hz)	Narrowband, stationary at harmonics	Notch filter at harmonic frequencies
Vehicle pass-by	Non-stationary, sweeping spectral peak	Adaptive filtering or multi-band tracking
Crowd / babble noise	Non-stationary, broadband	Adaptive noise cancellation if reference available
Wind on microphone	Low-frequency burst, non-stationary	Adaptive filter + high-pass gating

In practice, recordings often contain layered noise: a fixed HVAC background with intermittent vehicle pass-bys. The examiner processes in stages, removing the stationary component first, then addressing the residual non-stationary interference. Each stage is saved as a separate intermediate file so the chain of processing is auditable.

Noise classification decision tree: stationary sources (constant spectrum) route to spectral subtraction or Wiener filtering; non-stationary sources (time-varying spectrum) route to adaptive filtering or time-frequency masking; layered recordings are processed in stages, stationary first.

Spectral subtraction and Wiener filtering

Spectral subtraction, introduced by Boll in 1979, works in the frequency domain. For each short frame of audio, the magnitude spectrum of the noise estimate is subtracted from the magnitude spectrum of the noisy frame. The phase of the original frame is retained, and the result is inverse-transformed back to a time-domain signal. The noise estimate is typically derived from a few seconds of recording where no speech is present (silence before a conversation begins, a pause between sentences).

The main artefact is musical noise: residual tones scattered across the spectrum that sound like random beeping or a distant choir. They arise when the subtraction overshoots (subtracting more noise energy than was actually present in a particular frame) or when the noise estimate does not match the instantaneous noise level. Over-subtraction factors and spectral flooring (not letting the subtracted spectrum go below a small fraction of the noise estimate) reduce musical noise at the cost of leaving some residual background.

The Wiener filter takes a more principled approach. It computes, in each frequency bin, the gain that minimises the mean-squared error between the estimated clean signal and the noisy observation. The gain is driven by the local signal-to-noise ratio: in bins where speech dominates, the gain is close to 1 (pass most of the signal); in bins where noise dominates, the gain is close to 0 (attenuate heavily). The result is perceptually smoother but still depends critically on the quality of the noise power spectral density estimate.

Adaptive filtering for non-stationary interference

Adaptive filtering addresses interference that varies over time. Rather than deriving a fixed noise estimate from a single silent interval, the filter updates its coefficients continuously as the signal progresses, tracking changing noise characteristics. The most widely used adaptive algorithm in audio forensics is the least-mean-squares (LMS) filter and its variants, which adjust their tap weights after each new sample to reduce the error between the filter's output and a desired signal.

A reference signal is required: a microphone placed to capture mainly the interfering source with minimal speech capture. The adaptive filter models the path from that reference to the primary microphone and subtracts the estimated interference, the same principle used in noise-cancelling headsets. In forensic casework a dedicated reference microphone is almost never available, because the recording was not made for this purpose. Examiners must instead rely on time-frequency masking or multi-channel analysis if a second channel is available.

Adaptive noise cancellation: reference mic feeds the filter to suppress interference on the primary channel.

Bandwidth extension for telephone and codec-limited recordings

Traditional analogue telephony and many digital codecs (G.711, GSM-FR) limit audio bandwidth to approximately 300-3400 Hz, dropping the low-frequency energy below 300 Hz and the high-frequency consonant energy above 3400 Hz. The missing high-frequency band degrades fricative and affricate discrimination: /s/ and /f/, /ch/ and /sh/ become harder to distinguish. Bandwidth extension attempts to regenerate the missing frequency content from correlates present in the narrowband signal.

Classical bandwidth extension exploits the approximate periodicity of voiced speech harmonics: if a voiced segment has a fundamental frequency of 120 Hz, harmonics at 240 Hz, 360 Hz, and beyond are mathematically related to harmonics already present in the 300-3400 Hz band. Spectral folding and excitation estimation can reconstruct plausible high-frequency content. Machine-learning approaches train on narrowband/wideband recording pairs and learn to predict the missing spectral envelope.

Measuring intelligibility: PESQ and STOI

Before objective measures existed, evaluating enhancement quality required listening panels, which are slow, costly, and variable. Two standardised instrumental metrics have largely replaced listening panels for benchmarking:

PESQ (ITU-T P.862): Perceptual Evaluation of Speech Quality, originally designed to assess telephone network degradation. It compares a degraded signal against a clean reference and produces a Mean Opinion Score equivalent on a 1-4.5 scale. PESQ requires a clean reference recording of the same utterance, which is rarely available in casework, so its forensic application is mainly for validating processing chains in controlled tests.
STOI (Short-Time Objective Intelligibility): computes the correlation between short-time temporal envelopes of the clean and processed signals in one-third-octave bands. Scores range from 0 to 1 and correlate well with the proportion of words correctly identified by listeners. STOI also requires a clean reference but is more sensitive than PESQ to intelligibility changes rather than quality changes.

In practical forensic work, reference signals are rarely available. Examiners instead use informal A-B listening comparisons, documented transcription trials by trained listeners, and STOI estimates computed using noise-estimation-based surrogates for the reference. The goal is a documented, reproducible assessment of whether the enhancement improved the ability to understand speech.

Disclosure obligations and SWGDE guidelines

The SWGDE Best Practices for Forensic Audio (most recent version 2.5, published 2022) and equivalent guidance from the Audio Engineering Society (AES SC-03-12) require that the following be documented and disclosed for any enhanced recording submitted as evidence:

Original preservation: a bit-for-bit copy of the original recording must be made and verified against its hash before any processing begins. All subsequent work is on working copies.
Processing log: every processing step is documented in order: the tool used, version, parameters, and the purpose of each step. The log must be complete enough that another qualified examiner could reproduce the result.
Working-copy labelling: the enhanced file is clearly labelled as a working copy, not a substitute for the original. Both are submitted to court.
AGC caution: automatic gain control should be applied, if at all, only as a final step. Premature AGC amplifies background noise during speech gaps, which can obscure quiet words that were present.

Worked example

Enhancing a CCTV audio recording of an extortion threat

Layered noise, a cheap microphone, and a compressed codec: a realistic starting point.

A CCTV system in a retail store records a confrontation in which an employee alleges a customer made an extortion threat. The recording is from a low-cost IP camera, compressed to 8 kHz audio with a G.711 codec, and the shop was playing background music at the time. The audio submitted to the laboratory is 90 seconds long.

Preservation. Original file hashed (SHA-256 documented). All subsequent work on a working copy.
Noise classification. Listening identifies: (a) a stationary mid-frequency noise floor from the camera's internal fan, (b) non-stationary background music at approximately 65 dB SPL. The music is the primary intelligibility problem.
Step 1: stationary noise. A 3-second silence before the encounter yields a noise reference. Spectral subtraction removes the fan noise floor. Musical noise artefacts are moderate; over-subtraction factor set to 1.5 to reduce them.
Step 2: music interference. No reference channel is available. A time-frequency masking approach (non-negative matrix factorisation) is applied to separate the music component from the speech component. Results are partial: the music is attenuated by approximately 10 dB in the speech-dominated frames, leaving residual harmonic artefacts from the background track.
Step 3: intelligibility check. A trained listener transcribes the enhanced version and marks uncertain words. The key disputed phrase at second 47 yields two plausible candidate transcriptions. The examiner reports both, notes the ambiguity, and submits the original and the enhanced working copy. No bandwidth extension is applied because the intelligibility uncertainty is not frequency-bandwidth-related.

Realistic forensic enhancement rarely reaches perfect intelligibility. The examiner's contribution is applying every legitimate technique, documenting each step, and presenting the result honestly together with its limits.

Check your understanding

Question 1 of 4· 0 answered

Which noise reduction method works best for a stationary noise source such as HVAC hum?

Key Takeaways

Noise must be classified as stationary or non-stationary before choosing a method: spectral subtraction and Wiener filtering suit stationary sources; adaptive filtering suits time-varying interference.
Spectral subtraction can produce musical noise artefacts; over-subtraction factors and spectral flooring reduce them, but the examiner must document the parameters chosen.
Adaptive noise cancellation requires a reference channel that is rarely available in casework; time-frequency masking methods offer a partial substitute for single-channel recordings.
STOI scores give an instrumental intelligibility benchmark, useful for comparing processing chains without always needing listener trials, though a clean reference signal is normally required.
SWGDE guidelines require original preservation, a complete processing log, working-copy labelling, and honest reporting of intelligibility limits, including words that remain uncertain after best-effort enhancement.

What is spectral subtraction in audio enhancement?

Spectral subtraction estimates the spectrum of background noise during a silent passage and subtracts that noise estimate from each subsequent frame. It works well for stationary noise sources such as tape hiss or HVAC rumble, but can introduce musical noise artefacts when the noise estimate is imperfect.

How does Wiener filtering differ from spectral subtraction?

Wiener filtering uses the estimated noise power spectrum to compute a frequency-dependent gain that minimises the mean-squared error between the enhanced signal and the original speech. It typically produces less musical noise than spectral subtraction but still requires a reliable noise estimate and performs poorly on non-stationary interference.

What is SWGDE and why do its guidelines matter for audio enhancement?

SWGDE, the Scientific Working Group for Digital Evidence, publishes best-practice guidance for forensic audio examination in the United States. Its audio enhancement guidelines require that the original recording be preserved unmodified, that all processing steps be documented and disclosed, and that the enhanced output be presented as a working copy rather than a replacement for the original.

What do PESQ and STOI measure?

PESQ (Perceptual Evaluation of Speech Quality) and STOI (Short-Time Objective Intelligibility) are instrumental measures that predict how intelligible processed speech will be to a human listener without requiring listener trials. STOI specifically correlates well with the proportion of words correctly identified in noise, making it useful for benchmarking enhancement steps.

Can audio enhancement create or fabricate speech that was not in the original?

No legitimate forensic enhancement adds speech content that was not present in the original recording. Processing can only expose what is already there. If a word is truly masked beyond recovery, no filter can reconstruct it. This is why enhanced versions are always presented alongside the original and why disclosure of all processing steps is mandatory.

Test yourself on Forensic Audio, Video and Image Analysis with free, timed mocks.

Practice Forensic Audio, Video and Image Analysis questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Key Takeaways

Your journey to becoming a forensic professional starts here.