Deepfake Detection: Signal, Physiological, and Semantic Methods

A structured survey of how deepfake videos are detected, from spectral frequency-domain analysis and CNN residual detectors to physiological signals like eye-blink rate, facial geometry irregularities, and the generalisation challenge across unseen generators.

Last updated: 10 Jun 2026

Detection is always downstream of generation. Every deepfake detection method is, at its core, a search for something the generation pipeline got wrong. The GAN left a checkerboard in the frequency domain. The face-swap tool produced a blending seam. The talking-head system drove lips slightly out of sync. The voice cloner left prosodic flatness where natural speech would have varied. The forensic analyst's job is to know which specific imperfections to look for, because not all generators make the same mistakes.

Deepfake detection methods fall into three broad families. Signal-based methods analyse the raw pixel or audio data for statistical anomalies introduced by the generation pipeline. Physiological methods check whether the body in the video behaves as a real human body would: does it blink, does blood flow produce the right subtle colour changes in the skin, does head pose stay consistent with the claimed recording conditions. Semantic methods look for inconsistencies in the content itself: teeth that a rendering engine could not produce correctly, ears that do not match the person's known anatomy, illumination gradients that contradict the scene geometry.

Each family has strengths and weaknesses, and none works reliably across all generator types. This topic maps those strengths and weaknesses concretely, covering the key published methods and their known failure modes. It also sets up the arms-race dynamic honestly: every time a detection signal is published, it can become a target for the generation pipeline to suppress, which is why forensic practice requires a multi-signal approach rather than reliance on any single detector.

Key terms

Frequency-domain artefact: A periodic or statistical anomaly in the Fourier spectrum of an image or audio signal introduced by the generation pipeline's upsampling, filter, or codec operations. The GAN checkerboard pattern is the canonical example.
Noiseprint: A CNN-based camera-model fingerprint extractor by Cozzolino and Verdoliva. Applied to deepfakes, it reveals inconsistency between the camera fingerprint in the genuine background and the absent or different fingerprint in the synthesised face region.
Physiological signal: A biological process visible in video, such as eye blinking, rPPG (remote photoplethysmography), and head micro-motion from the cardiac cycle, that deepfake generators typically fail to replicate accurately.
rPPG: Remote photoplethysmography. A technique for measuring heart rate from subtle periodic colour changes in facial skin caused by blood-volume pulses. Authentic video shows coherent rPPG signals; synthesised faces typically do not.
CNN residual detector: A convolutional neural network trained on the high-frequency residual image, the difference between the original and a de-noised version, to classify whether an image was produced by a camera or a generative model.
Detection generalisation: The capacity of a trained detector to correctly identify deepfakes produced by generators not seen during training. Low generalisation is the central limitation of current deepfake detection systems.

Spectral artefacts and frequency-domain detection

The generator's upsampling step leaves a signature visible in the Fourier spectrum.

When a GAN generator upsamples its internal feature maps to produce a full-resolution image, it typically uses transposed convolution (also called deconvolution). This operation distributes values from a smaller grid onto a larger one by inserting zeros between elements and then applying a convolution. Because the filter kernels cover these inserted zeros unevenly, the output has a periodic pattern of varying energy at a spatial frequency determined by the upsampling stride. In an image this manifests as a faint regular grid, invisible to the eye but clearly present in a 2-D Fourier transform as a set of peaks at predictable locations.

Frequency-domain GAN detection workflow.

Frank et al. (2020) demonstrated that a simple classifier trained only on the Fourier spectrum of face crops achieves near-perfect accuracy on multiple GAN architectures, significantly outperforming classifiers trained in the pixel domain for cross-architecture generalisation. The limitation is that diffusion models and flow-based generators have different upsampling operations and do not produce the same checkerboard pattern. Frequency-domain detectors trained on GAN artefacts perform poorly on diffusion model output, reinforcing the generalisation problem.

Physiological inconsistency detection

Real faces blink, pulse, and move in ways that generators do not faithfully replicate.

Human bodies have involuntary, temporally structured behaviour that is hard to synthesise. Early GAN face generators were trained predominantly on open-eye portraits, which meant that generated videos had abnormally low eye-blink rates and unnatural blink dynamics. Li et al. (2018) exploited this, building a blink-detection module that classified videos based on whether the blink frequency and blink shape matched statistical priors for real subjects. This was one of the first published detectors and was effective against the generation methods available at that time.

Eye-blink rate: early GANs under-produced blinking; subsequent generators added explicit blink modelling, narrowing this gap but not eliminating it.
rPPG signal: authentic faces show a periodic subtle colour oscillation in the skin, at roughly the heart rate, detectable by averaging pixel values in cheek regions over time. Synthesised faces typically show no such coherent oscillation or show it only spuriously.
Head micro-motion: the cardiac cycle produces small, periodic head motion visible in video. Face-swap pipelines applied to real video will carry genuine head motion, but fully synthesised talking heads often have motion vector statistics inconsistent with this physiological source.
Gaze and pupil response: pupil dilation responses to scene lighting and gaze direction consistency with scene geometry are difficult to synthesise and provide useful secondary signals in high-resolution video.

Facial geometry and landmark irregularities

Some body parts are harder to synthesise correctly than others.

Facial landmark detection provides a fast, interpretable way to flag geometric irregularities. A set of 68 or 478 landmarks covering the eyes, nose, mouth, and jaw outline can be fitted to any face image. For authentic faces, the inter-landmark distances and angles fall within tight statistical bounds established from large datasets of real subjects. Synthesised faces, particularly those produced by older or lower-quality generation pipelines, show landmark configurations that deviate from these bounds, especially at the outer corners of the eyes, the ear region, and the nose-to-lip distance.

Beyond individual frames, temporal landmark trajectories in video carry information. A real talking head has facial muscle activations that follow biomechanical constraints: the orbicularis oculi contracts when smiling, the jaw moves in a predictable arc during speech. Synthesised videos often violate these soft constraints, producing landmark trajectories that are statistically improbable for genuine face motion.

Region	Detection signal	Reliability
Eyes	Blink rate, blink shape, pupil response	High for early GANs; reduced after generator adaptation
Teeth	Rendering artefacts: smearing, wrong number, incorrect occlusion	Moderate; improves at high resolution
Ears	Shape distortion, missing or duplicated tragus	Useful for NeRF-based synthesis; not for face-swap
Hair boundary	Soft transition artefacts at edge of face/background mask	Moderate; depends on background complexity
Skin texture	Unnaturally smooth or periodic high-frequency noise	High when comparing face to background region
Jaw and chin	Asymmetry, landmark drift across frames	Useful in video; weak on single images

CNN residual detectors and Noiseprint

Training a network to see the generation pipeline's invisible fingerprint.

The SRM (Steganalysis Rich Model) approach, originally developed for image steganography detection, was applied to deepfake detection by several research groups from 2019 onward. The key idea is to compute a high-pass residual image, the original minus a smoothed version, which removes low-frequency content and amplifies the subtle statistical texture left by processing operations. A CNN trained on residual images can learn to distinguish camera-native noise from generation-pipeline noise.

Cozzolino and Verdoliva's Noiseprint (2020) extends this idea by framing it as camera-model attribution. A CNN is trained to extract a fingerprint that identifies the camera model used to take a photograph, analogous to PRNU (photo-response non-uniformity) in traditional digital camera forensics. When applied to a deepfake image, the face region shows an absent or anomalous fingerprint, because it was synthesised rather than captured. The background shows a consistent camera fingerprint if it came from genuine source video. The boundary between these two regions is the detection signal.

Noiseprint detection: background shows camera fingerprint; face region does not.

Semantic inconsistency analysis

Some things generators consistently get wrong, and a trained eye can find them.

Semantic inconsistencies are failures of plausibility at the content level rather than at the signal level. They do not require a trained classifier or a Fourier transform. They require knowledge of what faces, bodies, and scenes look like. Teeth are a canonical example: generating a realistic arrangement of individual teeth, with correct occlusion, correct spacing, and correct response to lighting, is a consistently hard problem for generative models. High-resolution deepfakes often show smeared or incorrectly shaped teeth, especially in wide-open-mouth frames.

Illumination geometry: the direction of specular highlights on the skin and the direction of cast shadows should be geometrically consistent with each other and with the background scene. Inconsistent lighting direction is a reliable manual review signal.
Reflection in eyes: the corneal specular reflection (the small bright spot on the eye surface) should match the room's light sources and their positions. Mismatched or absent corneal reflections indicate the face was synthesised or composited from a different lighting environment.
Jewellery and accessories: rings, earrings, and glasses frames are structurally regular objects. Generators often warp or partially dissolve them, particularly near the skin-to-object boundary.
Temporal semantic consistency: an earring that changes shape between frames, a necklace that flickers, or a scar that appears and disappears are temporal semantic anomalies invisible in any single frame.

Detection generalisation and the arms race

The best detector today may be obsolete when the next generator ships.

The detection generalisation problem is well-documented in benchmark studies. Rossler et al. (2019) introduced FaceForensics++, a dataset with videos produced by four manipulation methods. Detectors trained on one manipulation method dropped in accuracy substantially when tested on a different one. The Cross-Generator Generalisation test, used in the FaceForensics++ and DFDC (DeepFake Detection Challenge) benchmarks, shows that accuracy on unseen generator families is substantially lower than accuracy on held-out examples of the same generator.

Proposed approaches to improve generalisation include frequency-domain training, which learns generation-family-agnostic spectral features rather than generator-specific pixel patterns; self-supervised contrastive pre-training on real-world image augmentations; and ensemble detectors that aggregate signals from multiple feature families. None of these fully resolves the problem, particularly against adversarially post-processed deepfakes where JPEG compression or film-grain overlays deliberately suppress spectral artefacts.

In practice, forensic conclusions based on deepfake detection should state the specific detector or method applied, the generator families it was validated against, the test conditions, and the confidence bounds on the result. Presenting a detector output without these qualifications overstates what the technology can actually deliver.

Worked example

Analysing a suspected deepfake in a harassment case

Applying a multi-signal approach when no single detector is conclusive.

A complainant submits a video as evidence of harassment, alleging it is a deepfake. The video is 45 seconds long, shows the complainant's face on a body performing acts they deny, and was posted to a social media platform. The platform's original file was retrieved via legal process at 720p H.264. The analyst applies a structured multi-signal examination.

Metadata. No camera EXIF. File creation timestamp is two days before posting. No GPS. This is consistent with re-encoded synthetic output but not diagnostic on its own, as many genuine videos are also re-encoded.
Frequency domain. A 2-D FFT of face-region crops at three time points shows spectral peaks at 1/8 pixel intervals, consistent with a GAN with stride-8 upsampling. The body region shows no such peaks.
Noiseprint map. The face region shows near-zero camera fingerprint energy. The body region shows a consistent camera fingerprint. The facial boundary is visible as a sharp fingerprint discontinuity.
Landmark geometry. A 68-point landmark fit finds inter-ocular distance fluctuates by 3.4 pixels across frames without corresponding head-pose change. Authentic video of the complainant provided by their solicitor shows fluctuation of under 0.8 pixels at comparable resolution.
Illumination analysis. Specular highlights on the face are directionally consistent with frontal lighting. The body's shadows indicate oblique side lighting. The two lighting environments are inconsistent with a single recording setup.
Conclusion. Four independent signals converge: GAN spectral artefact in the face region, Noiseprint fingerprint discontinuity at the face boundary, landmark geometry drift, and illumination inconsistency. The conclusion presented to court is that the face in the video shows strong evidence of synthetic substitution consistent with a GAN-based face-swap pipeline, and that the face and body regions carry distinct provenance signatures.

The case illustrates the value of signal convergence. No single finding was individually sufficient for court presentation. Four converging signals from different feature families, each with a different failure mode, together constitute a defensible forensic opinion.

Check your understanding

Question 1 of 4· 0 answered

Why does a frequency-domain GAN detector trained on one GAN architecture perform poorly on a different one?

Key Takeaways

Deepfake detection falls into three families: signal-based (spectral, residual, compression), physiological (blink, rPPG, micro-motion), and semantic (lighting, geometry, accessory rendering).
GAN outputs show checkerboard spectral peaks from transposed-convolution upsampling; Frank et al. (2020) demonstrated this as a cross-architecture discriminative feature.
Physiological detectors exploit the failure of generators to replicate involuntary biological signals; they are time-limited because generators adapt once a signal is published.
Noiseprint exploits the camera-model fingerprint, which is present in genuine video backgrounds but absent or inconsistent in synthesised face regions of face-swap deepfakes.
Detection generalisation across unseen generator families is the central unresolved challenge; adversarial post-processing further degrades detector performance in real deployments.
Forensic practice requires multi-signal convergence; a multi-signal opinion combines findings from multiple independent methods, each reported with its validation scope and confidence bounds.

What is the physiological method for deepfake detection based on eye blinking?

Li et al. (2018) observed that GAN-generated face videos did not replicate the natural eye-blinking rate and pattern of real subjects, because GAN training data contained far fewer closed-eye images. A detector trained on blink frequency and blink duration outperformed pixel-level detectors on that generation family. However, once the paper was published, subsequent generators were fine-tuned with closed-eye training data, illustrating the arms-race dynamic.

What is Noiseprint and how does it detect deepfakes?

Noiseprint, introduced by Cozzolino and Verdoliva (2020), is a CNN trained to extract a camera-model fingerprint from images, analogous to PRNU in traditional camera forensics. For deepfake detection, a Noiseprint map of an authentic frame shows consistent camera-model noise; a deepfake frame shows an inconsistent or absent model fingerprint in the synthesised face region, making the substitution visible.

Why do deepfake detectors struggle to generalise across different generators?

Most detectors learn the specific artefact signature of the generators they were trained on. When a novel architecture is released with different upsampling methods or adversarial post-processing, the trained detector's features no longer match and accuracy drops sharply. Cross-generator generalisation is an active research problem addressed by frequency-domain and self-supervised approaches.

Can semantic inconsistencies like incorrect ear rendering reliably detect deepfakes?

Semantic cues such as ear structure, hair boundary, and teeth rendering are useful secondary signals but are difficult to turn into reliable automated detectors because they require high image resolution and consistent viewpoint. They are more useful as manual-review checkpoints for an analyst examining a challenged image. Their reliability varies with the generation method: face-swap pipelines preserve the original ear structure, while NeRF-based systems often distort it.

Test yourself on Forensic Audio, Video and Image Analysis with free, timed mocks.

Practice Forensic Audio, Video and Image Analysis questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Region

Detection signal

Reliability

Eyes

Blink rate, blink shape, pupil response

High for early GANs; reduced after generator adaptation

Teeth

Rendering artefacts: smearing, wrong number, incorrect occlusion

Moderate; improves at high resolution

Ears

Shape distortion, missing or duplicated tragus

Useful for NeRF-based synthesis; not for face-swap

Hair boundary

Soft transition artefacts at edge of face/background mask

Moderate; depends on background complexity

Skin texture

Unnaturally smooth or periodic high-frequency noise

High when comparing face to background region

Jaw and chin

Asymmetry, landmark drift across frames

Useful in video; weak on single images

Key Takeaways

Deepfake detection falls into three families: signal-based (spectral, residual, compression), physiological (blink, rPPG, micro-motion), and semantic (lighting, geometry, accessory rendering).

GAN outputs show checkerboard spectral peaks from transposed-convolution upsampling; Frank et al. (2020) demonstrated this as a cross-architecture discriminative feature.

Physiological detectors exploit the failure of generators to replicate involuntary biological signals; they are time-limited because generators adapt once a signal is published.

Noiseprint exploits the camera-model fingerprint, which is present in genuine video backgrounds but absent or inconsistent in synthesised face regions of face-swap deepfakes.

Detection generalisation across unseen generator families is the central unresolved challenge; adversarial post-processing further degrades detector performance in real deployments.

Forensic practice requires multi-signal convergence; a multi-signal opinion combines findings from multiple independent methods, each reported with its validation scope and confidence bounds.

What is the physiological method for deepfake detection based on eye blinking?

What is Noiseprint and how does it detect deepfakes?

Why do deepfake detectors struggle to generalise across different generators?

Can semantic inconsistencies like incorrect ear rendering reliably detect deepfakes?

Your journey to becoming a forensic professional starts here.

Key Takeaways

Key Takeaways