Video and Audio File Structures: Containers, Codecs, and Compression

Video and audio files are two-layer structures: a container that organises tracks and timing, and a codec that compresses the actual media data. The interaction between these layers creates characteristic artefacts that forensic examiners use to detect re-encoding, reconstruct timelines, and identify source devices.

Last updated: 19 Jun 2026

Video and audio files are built as two independent layers: a container format (MP4, MKV, MOV, AVI, MXF) that organises tracks, timing, and metadata, and a codec (H.264, HEVC, AAC, MP3) that compresses the actual media data inside those tracks. This separation matters forensically because rewrapping a file, moving its bitstream to a different container, leaves the codec data unchanged, while re-encoding decodes and recompresses it, introducing new generation loss and potentially masking the encoder fingerprint of the original recording. Inter-frame compression, in which most video frames store only differences from neighbouring frames rather than complete images, means that irregularities in the spacing of self-contained I-frames are a primary structural indicator of editing. Lossy audio codecs leave permanent spectral signatures, including frequency ceilings and masking holes, that persist even when the file is subsequently converted to an uncompressed format.

Every video file submitted as evidence has a nested structure: an outer container that organises multiple tracks and encodes timing, and inner codec streams that compress the actual image and sound data. The distinction between these layers is directly operational: it determines whether a forensic examiner can distinguish a re-encoded video, which may have been manipulated, from one that was only rewrapped and is forensically equivalent to the original, and it is what makes I-frame irregularities a structural marker for editing.

Inter-frame compression is the other key idea. Video codecs like H.264 do not store each frame as a complete image. Most frames store only the differences from their neighbours, so decoding a single frame requires knowledge of the frames around it. This Group of Pictures (GOP) structure is efficient for playback but critical for forensic analysis: tampering with one frame can corrupt many around it, and the pattern of frame types is an encoder fingerprint.

Audio forensics adds another layer. Lossless formats like WAV preserve the actual sample values; lossy formats like MP3 and AAC apply psychoacoustic models that discard information permanently. Each codec leaves characteristic compression artefacts: a recording that claims to be original but shows the spectral holes of MP3 encoding carries evidence of its encoding history regardless of its current file format.

By the end of this topic you will be able to:

Distinguish container formats from codecs and explain why the same codec can appear in multiple containers without affecting the underlying bitstream.
Describe the Group of Pictures (GOP) structure in H.264/HEVC and explain why irregular I-frame spacing is a forensic indicator of editing or re-encoding.
Differentiate rewrapping from re-encoding and identify the evidentiary implications of each for encoder fingerprint preservation.
Identify the spectral artefacts left by lossy audio codecs (MP3, AAC) and explain how they can reveal a recording's encoding history even in files saved as uncompressed PCM.
Read a video file's container structure (ftyp brand, moov/mdat position, bitrate profile) to assess consistency with a claimed recording device.

Key terms

Container format: A file format (MP4, MKV, MOV, AVI, MXF) that multiplexes video, audio, and metadata tracks into a single file with a defined byte structure. The container handles synchronisation and seeking; it does not define how the media is compressed.
Codec: A coder/decoder algorithm that compresses and decompresses media data. H.264/AVC and H.265/HEVC are the dominant video codecs; AAC and MP3 are dominant for audio. The codec determines compression efficiency, quality loss, and the nature of artefacts in the output.
I-frame: An intra-coded frame: a self-contained image compressed without reference to any other frame, like a JPEG. I-frames are the only random-access entry points in a video stream.
GOP (Group of Pictures): The repeating sequence of I, P, and B frames in an inter-frame compressed video. GOP size (the distance between I-frames) is an encoder setting that affects file size, seek latency, and forensic detectability of editing.
Rewrapping: Moving a compressed bitstream from one container to another without decoding it. Forensically equivalent to the original encoded stream; no new generation loss is introduced.
Perceptual coding: Lossy compression that exploits psychoacoustic or psychovisual masking to discard information the human senses are least sensitive to. MP3, AAC, and AC-3 audio codecs all use this approach, producing spectral artefacts that persist through subsequent conversions.

Container formats: the outer packaging

A container format defines a byte-level specification for how multiple streams of data (video, audio, subtitles, chapters, metadata) are interleaved and indexed within a single file. The container handles the administrative structure: which byte range contains which track, at what time offset, and in what order samples should be presented. It is analogous to a ZIP archive that carries separate files but also encodes the relationship between them.

Container	Extension	Governing spec	Common use context
ISO Base Media File Format	.mp4 / .m4v / .m4a	ISO/IEC 14496-12	Phones, tablets, streaming, web; most common container in current casework
Matroska	.mkv / .mka / .webm	Matroska spec (IETF draft)	Open-source community, long-term archiving, flexible codec support
QuickTime	.mov	Apple QTFF spec	Apple devices; structurally similar to MP4, shares atom/box hierarchy
AVI	.avi	Microsoft RIFF/AVI spec	Legacy Windows recordings, older CCTV systems
MXF	.mxf	SMPTE 377-1	Broadcast, professional production, digital cinema

For a forensic examiner, the container structure is the first place to look. MP4 and MOV files are built from a hierarchy of typed units called atoms (QuickTime) or boxes (ISO BMFF). Key boxes include moov (the index of all track metadata and sample locations), mdat (the actual media data), and ftyp (the brand identifier that specifies which variant of the spec the file targets). The position of moov relative to mdat in the file reveals whether the file was optimised for streaming (moov first) or captured in a single pass (moov last), which is a useful indicator of the recording workflow.

Video codecs and inter-frame compression

H.264 (Advanced Video Coding, AVC) is the most widely deployed video codec in forensic casework. It compresses video using a combination of intra-frame compression (applied within a single frame) and inter-frame compression (exploiting similarities between adjacent frames). Understanding both is necessary to assess what information survives compression and what does not.

Intra-frame compression in H.264 divides each frame into macroblocks (16×16 pixel regions) and predicts each macroblock from its neighbours within the same frame, then encodes the prediction residual with a transform (4×4 or 8×8 DCT), quantisation, and entropy coding. Inter-frame compression predicts a macroblock from a nearby block in a reference frame, encoding only the motion vector and the residual.

H.264 GOP structure: I, P, B frame sequence and dependency chain.

H.265 (HEVC) extends H.264's approach with larger coding units (up to 64×64), improved intra prediction modes, and better parallel processing. For forensics, HEVC produces fewer blockiness artefacts at equivalent bitrates, which can make some visual-quality based manipulation detectors less reliable at high compression. VP9 and AV1, used in web streaming, follow similar inter-frame principles with open-source codecs that do not require licensing fees.

MPEG-2, though older (1995), remains common in broadcast footage, standard-definition CCTV, and DVD-sourced material. Its macroblock structure (16×16 blocks, no recursion) produces a more visible blocky appearance at high compression, which can be useful for estimating the original recording bitrate or detecting resampling.

Audio containers and compression artefacts

Audio evidence appears most commonly in four formats: WAV (PCM), FLAC, MP3, and AAC. The distinction between lossless and lossy matters more acutely for audio than for video because forensic audio tasks often depend on fine spectral features. Speaker identification, gunshot detection, and background noise analysis all require the original frequency content, not a perceptually filtered approximation.

Format	Compression type	Typical use	Forensic implication
WAV (PCM)	None (uncompressed)	Studio recording, phone call logs, court-ordered intercepts	Gold standard; all spectral information intact
FLAC	Lossless	Archival, audiophile; also some CCTV systems	Bit-exact reconstruction; metadata includes encoding software
MP3 (MPEG-1 Layer III)	Lossy, perceptual coding	Consumer music, voice memos, older phones	Pre-echo, temporal masking holes, frequency cutoff at bitrate-dependent ceiling
AAC (Advanced Audio Coding)	Lossy, perceptual coding	Modern phones, streaming, iOS recordings	Similar artefacts to MP3 but better at low bitrates; HEAAC used below 32 kbps

MP3 encoding applies a short-time Fourier transform (MDCT), models the audibility of each frequency component against the current masking threshold, and discards or coarsely quantises components below the threshold. This produces a spectral ceiling: at 128 kbps, all frequency content above roughly 16 kHz is discarded. A recording that claims to be uncompressed PCM but shows a flat spectral noise floor up to 16 kHz and then zero energy above it was at some point encoded as MP3, regardless of what its file header says.

Re-encoding versus rewrapping: the forensic distinction

Rewrapping moves compressed bitstream data from one container to another without touching the codec layer. A Matroska MKV file containing an H.264 stream rewrapped into an MP4 container produces a file whose video data is bit-for-bit identical to the original. The codec artefacts, PRNU-equivalent noise patterns, and encoder fingerprints are all preserved. This operation is common in legitimate workflows: a video editor may rewrap for compatibility without re-encoding the content.

Re-encoding is different. It decodes the compressed bitstream back to raw frames (or samples), then re-applies the codec at some quality setting. Every re-encoding round introduces new quantisation error and potentially new artefacts. The original encoder fingerprint is diluted. If a manipulated region was introduced during re-encoding, the entire video may show fresh artefacts that obscure what was there before. Detecting whether a video has been re-encoded at least once is therefore one of the first questions a forensic video examiner asks.

Re-encoding vs rewrapping: the difference in what changes.

Detecting re-encoding uses several signals. A sudden change in the bitrate profile, I-frame density, or quantisation parameter (QP) curve mid-file is a candidate marker. The presence of encoder-specific metadata from a different tool (e.g., an x264 encoding tag in a file claimed to be from a specific camera model) is another. Comparing the DCT coefficient statistics across the file can reveal regions with distinct quantisation histories.

How containers and codecs leave encoder fingerprints

Encoders have degrees of freedom beyond what the codec standard requires: GOP size, reference frame count, quantisation parameter curves, rate control mode, in-loop filter strength, and the specific entropy coding tables chosen. Different manufacturers and software applications make different choices, and those choices leave patterns in the bitstream. Forensic analysis of the video bitstream can infer the encoder in much the same way that EXIF metadata identifies a camera model.

GOP structure: consumer cameras typically produce closed GOPs with a fixed period (every 15, 30, or 60 frames). Open-source encoders like x264 often use open GOPs. The period itself can help identify the recording device.
Bitrate management: constant bitrate (CBR) encoding is common in broadcast and CCTV; variable bitrate (VBR) is typical of modern phone cameras. The shape of the per-GOP bitrate curve is a device-specific signal.
Metadata atoms/boxes: MP4 and MOV files often contain vendor-specific boxes (udta, uuid) set by the recording software. These carry creation time, software version, and sometimes GPS data. Their presence or absence, and their byte layout, are device fingerprints.
Audio synchronisation offsets: different recorders introduce systematic audio-video offsets at known values. An offset inconsistent with the claimed device is a potential indicator of audio or video substitution.

Worked example

A CCTV clip with a missing eight-second window

GOP irregularity reveals where a deletion occurred in a transport system recording.

An MP4 file extracted from a bus company's DVR system is submitted as evidence in an assault case. The file is 47 minutes long. Defence counsel notes that a specific eight-second window they expected to see based on witness accounts appears to contain only black frames. The question is whether the clip was tampered with or whether the camera simply lost signal.

Container audit. The MP4 box hierarchy is parsed. The moov atom is positioned at the end of the file (consistent with a single-pass DVR recording, not a streaming-optimised export). ftyp brand is 'MSNV' (Sony PSP, PlayStation Portable). The creation-time atom matches the date of the incident. So far, consistent.
GOP map. I-frame positions across the full 47-minute file are extracted. The DVR records at 25 fps with a closed GOP of 50 frames (one I-frame every 2 seconds). This is consistent throughout the clip except for one segment at 23m 14s: three consecutive GOPs are 12, 88, and 14 frames instead of 50. This anomaly spans almost exactly 8 seconds.
Bitrate profile. The per-GOP bitrate across the file is nearly constant (consistent with CCTV CBR encoding) except at the anomalous segment, where bitrate drops to near zero (the black frames are almost all-zero DCT coefficients, hence tiny) then jumps before returning to baseline. This is not consistent with a camera signal-loss event, which would produce I-frames at the recovery point rather than a continued P-frame chain.
Comparison with reference recordings. Five recordings from the same DVR model are obtained. All show consistent 50-frame GOP periods across their full duration, including during genuine short signal-loss events (which introduce new I-frames at recovery). The accused file's mid-GOP anomaly is inconsistent with the device's normal behaviour.
Conclusion. The examiner reports that the GOP structure anomaly at 23m 14s is inconsistent with normal recording from this device model and is consistent with deletion and replacement of approximately 8 seconds of footage, noting that genuine signal loss would have produced different structural markers. The underlying original content cannot be recovered from the submitted file.

Structural analysis of the container and bitstream identified the anomaly here where visual inspection could not. The black frames were consistent with a technical fault to a non-specialist; the GOP structure was not consistent with how this device behaves during genuine signal loss.

Check your understanding

Question 1 of 4· 0 answered

A video file has extension .mp4 but ffprobe reports the video codec as H.265 (HEVC). What does this tell you about the container and codec relationship?

Key Takeaways

Container format and codec are independent layers: the container manages track organisation and timing; the codec compresses the actual media data. The same codec can appear in multiple containers, and the same container can wrap multiple codecs.
Inter-frame compression (I, P, B frames and GOP structure) means most video frames cannot be decoded without reference to their neighbours, and irregularities in I-frame spacing are a primary indicator of editing or re-encoding.
Rewrapping transfers the bitstream unchanged between containers; re-encoding decodes and recompresses, adding generation loss and potentially masking the encoder fingerprint of the original recording.
Lossy audio codecs (MP3, AAC) leave permanent spectral artefacts including frequency ceilings and masking holes that persist even if the file is later converted to uncompressed PCM WAV.
Encoder fingerprints in the GOP period, bitrate curve, ftyp brand, and metadata boxes allow the recording device or software to be identified, and inconsistencies in these fingerprints are evidence of re-encoding or fabrication.

What is the difference between a container format and a codec?

A container (MP4, MKV, MOV, AVI) is a file format that organises multiple tracks together with timing information and metadata. A codec (H.264, HEVC, VP9, AAC) is the algorithm that compresses and decompresses the actual media data stored inside the container. The same codec can be wrapped in different containers, and the same container can hold data compressed by different codecs.

What are I-frames, P-frames, and B-frames in video compression?

In inter-frame compression, an I-frame is a self-contained image that does not reference any other frame. A P-frame stores only the differences from a previous I or P frame. A B-frame references both past and future frames. This GOP structure means a video cannot be decoded from an arbitrary starting point without seeking back to the nearest I-frame.

How does re-encoding a video differ from rewrapping its container?

Rewrapping moves the compressed bitstream from one container to another without decoding the video. The codec data is unchanged, so no additional generation loss occurs. Re-encoding decodes the compressed video and recompresses it, introducing quality loss and producing a new set of compression artefacts that reveal the editing history.

What forensic clues does a video's GOP structure leave?

The spacing of I-frames indicates encoder settings. Consumer cameras typically place I-frames at fixed intervals. Irregular I-frame spacing may indicate editing or re-encoding at that point. Bitrate fluctuations and quantisation parameter curves can also localise editing points within the timeline.

Why is uncompressed WAV audio preferred for forensic audio analysis over MP3?

MP3 uses perceptual coding to discard audio information the human ear is less sensitive to, introducing irreversible distortions. WAV stores the actual sample values without psychoacoustic filtering. Forensic tasks such as speaker identification and noise analysis rely on the full waveform. MP3 artefacts can also interfere with spectrographic comparison.

Test yourself on Forensic Audio, Video and Image Analysis with free, timed mocks.

Practice Forensic Audio, Video and Image Analysis questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Key Takeaways

Your journey to becoming a forensic professional starts here.