Practice with mock tests, learn from structured notes, and get your questions answered by a global forensic community, all in one place.
A structured survey of how deepfake videos are detected, from spectral frequency-domain analysis and CNN residual detectors to physiological signals like eye-blink rate, facial geometry irregularities, and the generalisation challenge across unseen generators.
Last updated:
Detection is always downstream of generation. Every deepfake detection method is, at its core, a search for something the generation pipeline got wrong. The GAN left a checkerboard in the frequency domain. The face-swap tool produced a blending seam. The talking-head system drove lips slightly out of sync. The voice cloner left prosodic flatness where natural speech would have varied. The forensic analyst's job is to know which specific imperfections to look for, because not all generators make the same mistakes.
Deepfake detection methods fall into three broad families. Signal-based methods analyse the raw pixel or audio data for statistical anomalies introduced by the generation pipeline. Physiological methods check whether the body in the video behaves as a real human body would: does it blink, does blood flow produce the right subtle colour changes in the skin, does head pose stay consistent with the claimed recording conditions. Semantic methods look for inconsistencies in the content itself: teeth that a rendering engine could not produce correctly, ears that do not match the person's known anatomy, illumination gradients that contradict the scene geometry.
Each family has strengths and weaknesses, and none works reliably across all generator types. This topic maps those strengths and weaknesses concretely, covering the key published methods and their known failure modes. It also sets up the arms-race dynamic honestly: every time a detection signal is published, it can become a target for the generation pipeline to suppress, which is why forensic practice requires a multi-signal approach rather than reliance on any single detector.
The generator's upsampling step leaves a signature visible in the Fourier spectrum.
When a GAN generator upsamples its internal feature maps to produce a full-resolution image, it typically uses transposed convolution (also called deconvolution). This operation distributes values from a smaller grid onto a larger one by inserting zeros between elements and then applying a convolution. Because the filter kernels cover these inserted zeros unevenly, the output has a periodic pattern of varying energy at a spatial frequency determined by the upsampling stride. In an image this manifests as a faint regular grid, invisible to the eye but clearly present in a 2-D Fourier transform as a set of peaks at predictable locations.
Frank et al. (2020) demonstrated that a simple classifier trained only on the Fourier spectrum of face crops achieves near-perfect accuracy on multiple GAN architectures, significantly outperforming classifiers trained in the pixel domain for cross-architecture generalisation. The limitation is that diffusion models and flow-based generators have different upsampling operations and do not produce the same checkerboard pattern. Frequency-domain detectors trained on GAN artefacts perform poorly on diffusion model output, reinforcing the generalisation problem.
Real faces blink, pulse, and move in ways that generators do not faithfully replicate.
Human bodies have involuntary, temporally structured behaviour that is hard to synthesise. Early GAN face generators were trained predominantly on open-eye portraits, which meant that generated videos had abnormally low eye-blink rates and unnatural blink dynamics. Li et al. (2018) exploited this, building a blink-detection module that classified videos based on whether the blink frequency and blink shape matched statistical priors for real subjects. This was one of the first published detectors and was effective against the generation methods available at that time.
Some body parts are harder to synthesise correctly than others.
Facial landmark detection provides a fast, interpretable way to flag geometric irregularities. A set of 68 or 478 landmarks covering the eyes, nose, mouth, and jaw outline can be fitted to any face image. For authentic faces, the inter-landmark distances and angles fall within tight statistical bounds established from large datasets of real subjects. Synthesised faces, particularly those produced by older or lower-quality generation pipelines, show landmark configurations that deviate from these bounds, especially at the outer corners of the eyes, the ear region, and the nose-to-lip distance.
Beyond individual frames, temporal landmark trajectories in video carry information. A real talking head has facial muscle activations that follow biomechanical constraints: the orbicularis oculi contracts when smiling, the jaw moves in a predictable arc during speech. Synthesised videos often violate these soft constraints, producing landmark trajectories that are statistically improbable for genuine face motion.
| Region | Detection signal | Reliability |
|---|---|---|
| Eyes | Blink rate, blink shape, pupil response | High for early GANs; reduced after generator adaptation |
| Teeth | Rendering artefacts: smearing, wrong number, incorrect occlusion | Moderate; improves at high resolution |
| Ears | Shape distortion, missing or duplicated tragus | Useful for NeRF-based synthesis; not for face-swap |
| Hair boundary | Soft transition artefacts at edge of face/background mask | Moderate; depends on background complexity |
| Skin texture | Unnaturally smooth or periodic high-frequency noise | High when comparing face to background region |
| Jaw and chin | Asymmetry, landmark drift across frames | Useful in video; weak on single images |
Training a network to see the generation pipeline's invisible fingerprint.
The SRM (Steganalysis Rich Model) approach, originally developed for image steganography detection, was applied to deepfake detection by several research groups from 2019 onward. The key idea is to compute a high-pass residual image, the original minus a smoothed version, which removes low-frequency content and amplifies the subtle statistical texture left by processing operations. A CNN trained on residual images can learn to distinguish camera-native noise from generation-pipeline noise.
Cozzolino and Verdoliva's Noiseprint (2020) extends this idea by framing it as camera-model attribution. A CNN is trained to extract a fingerprint that identifies the camera model used to take a photograph, analogous to PRNU (photo-response non-uniformity) in traditional digital camera forensics. When applied to a deepfake image, the face region shows an absent or anomalous fingerprint, because it was synthesised rather than captured. The background shows a consistent camera fingerprint if it came from genuine source video. The boundary between these two regions is the detection signal.
Some things generators consistently get wrong, and a trained eye can find them.
Semantic inconsistencies are failures of plausibility at the content level rather than at the signal level. They do not require a trained classifier or a Fourier transform. They require knowledge of what faces, bodies, and scenes look like. Teeth are a canonical example: generating a realistic arrangement of individual teeth, with correct occlusion, correct spacing, and correct response to lighting, is a consistently hard problem for generative models. High-resolution deepfakes often show smeared or incorrectly shaped teeth, especially in wide-open-mouth frames.
The best detector today may be obsolete when the next generator ships.
The detection generalisation problem is well-documented in benchmark studies. Rossler et al. (2019) introduced FaceForensics++, a dataset with videos produced by four manipulation methods. Detectors trained on one manipulation method dropped in accuracy substantially when tested on a different one. The Cross-Generator Generalisation test, used in the FaceForensics++ and DFDC (DeepFake Detection Challenge) benchmarks, shows that accuracy on unseen generator families is substantially lower than accuracy on held-out examples of the same generator.
Proposed approaches to improve generalisation include frequency-domain training, which learns generation-family-agnostic spectral features rather than generator-specific pixel patterns; self-supervised contrastive pre-training on real-world image augmentations; and ensemble detectors that aggregate signals from multiple feature families. None of these fully resolves the problem, particularly against adversarially post-processed deepfakes where JPEG compression or film-grain overlays deliberately suppress spectral artefacts.
In practice, forensic conclusions based on deepfake detection should state the specific detector or method applied, the generator families it was validated against, the test conditions, and the confidence bounds on the result. Presenting a detector output without these qualifications overstates what the technology can actually deliver.
Why does a frequency-domain GAN detector trained on one GAN architecture perform poorly on a different one?
Test yourself on Forensic Audio, Video and Image Analysis with free, timed mocks.
Practice Forensic Audio, Video and Image Analysis questionsSpotted an error in this page? Report a correction or read our editorial standards.