Practice with mock tests, learn from structured notes, and get your questions answered by a global forensic community, all in one place.
A technical walkthrough of how deepfakes are made, covering GAN-based face swapping, diffusion model synthesis, neural talking heads, and voice cloning, and why understanding these generation artefacts is the first step in detecting them.
Last updated:
In 2017 a Reddit user called "deepfakes" began posting face-swapped videos created with open-source machine learning tools. Within two years the techniques had spread to consumer-grade laptops, professional-quality toolkits, and eventually browser extensions. Today the term deepfake covers at least four distinct families of technology: GAN-based face swapping, encoder-decoder face replacement, diffusion model image synthesis, and neural voice cloning. They share one property relevant to forensic work: every method leaves traces of how it was made.
Understanding how deepfakes are generated is not a side note for forensic scientists. It is the prerequisite for detection. A detector that does not know what kind of artefact it is looking for will perform unreliably, particularly against generators it has never seen. The checkerboard patterns that betray one GAN architecture do not appear in a diffusion model's output. The blending seam typical of a face-swap pipeline does not exist in a fully synthetic image. Each generation method leaves its own signature, and a forensic analyst needs to know which signatures to look for.
This topic walks through the main generation pipelines in enough technical depth to ground detection work. It covers the adversarial training dynamic that defines GANs, the encoder-decoder architecture behind tools like DeepFaceLab, the neural-rendering and NeRF-based talking-head systems used in political disinformation, diffusion model synthesis and its latent-space properties, and voice cloning via text-to-speech and voice conversion. Each section ends with the forensic relevance of the artefacts that method introduces.
Two networks, competing until neither can win cleanly.
Ian Goodfellow and colleagues introduced the Generative Adversarial Network framework in 2014. The essential insight is that you can teach a network to generate realistic images by pairing it with a second network whose only job is to tell real from fake. The generator tries to fool the discriminator; the discriminator tries to catch the generator. Each one drives the other to improve. At convergence, the generator has learned to sample from a distribution that closely resembles the training data.
For face generation specifically, architectures such as StyleGAN2 (Karras et al., 2020) and ProGAN extended the basic framework with progressive training, style-based control of facial features, and attention mechanisms. These models can produce 1024x1024 portraits that are almost indistinguishable from photographs to an untrained observer. Forensically, they leave characteristic artefacts in the high-frequency spectrum: a checkerboard pattern introduced by transposed convolutions in the upsampling layers, and a statistical distribution of pixel intensity that differs from the noise structure of a real camera sensor.
The original deepfake: one person's face, another person's movements.
The original "deepfakes" method, later formalised in tools such as DeepFaceLab and FaceSwap, uses a shared encoder with two identity-specific decoders. Both decoders are trained to reconstruct their respective target's face from a shared compressed representation of facial geometry and expression. At inference time you substitute: take a video of person A, encode each frame, decode it with B's decoder, and blend the result back into the original. Person A's head movements and expressions now appear on person B's face.
Forensically, face-swap videos carry a split provenance. The background, lighting, camera motion blur, and compression artefacts all derive from the original source video. Only the face region is synthesised. This creates detectable inconsistencies: the face region may have different noise characteristics, different JPEG block patterns, or response properties inconsistent with the illumination of the surrounding scene. These are the signals that spatial and compression-domain detectors exploit.
Animating a still photograph into a speaking, moving person.
Face-swap pipelines need video of the source identity. NeRF-based talking-head systems do not. Given a single photograph or a short video clip, these models build a volumetric neural representation of the subject's head and can then animate it with arbitrary head pose and lip movements driven by an audio track. Systems such as SadTalker (Zhang et al., 2023) and AD-NeRF produce output in which a real person appears to say words they never said.
The forensic challenge with talking-head systems is that they often produce fewer blending seams than classic face swaps, because the entire face is rendered from the volumetric model rather than transplanted from a source video. Artefacts instead appear in areas of fine detail: teeth, inner mouth structures, and the boundary between the head and the background. The temporal consistency of natural head motion is also difficult to fully replicate, and subtle jitter or motion-vector irregularities can be detected in compressed video bitstreams.
Noise reversed into a photograph that never happened.
Diffusion models, introduced in the 2020-2022 wave of research, work on a different principle from GANs. During training the model learns to undo the effect of adding Gaussian noise to an image, step by step. At generation time it starts from pure noise and repeatedly applies the learned denoising function, guided by a text or image prompt, until a coherent image emerges. Stable Diffusion (Rombach et al., 2022) runs this process in a compressed latent space rather than pixel space, which makes high-resolution generation feasible on consumer hardware.
| Property | GAN output | Diffusion model output |
|---|---|---|
| Generation speed | Fast: single forward pass | Slower: many denoising steps |
| Spectral artefacts | Checkerboard from transposed convolution | High-frequency periodicity from U-Net upsampling |
| Blending seams | Present at face boundary in face-swap use | Absent in fully synthetic output |
| Out-of-distribution handling | Fails at uncommon poses/lighting | Better generalisation via text conditioning |
| Forensic fingerprint source | Generator weights as a device fingerprint | Diffusion schedule and U-Net architecture |
| Detection maturity | Well-studied, many published detectors | Fewer published detectors as of 2024 |
From a forensic standpoint, diffusion model outputs lack camera-native properties. A photograph carries sensor noise (photon shot noise, read noise), optical vignetting, chromatic aberration, and EXIF metadata tied to a specific device. A diffusion-generated image has none of these. Its noise is the residual of the denoising process, its spectral energy distribution is statistically distinct, and it carries no authentic EXIF history. Detector approaches based on these properties can achieve high accuracy on known generators, but generalisation to new model families remains an open problem.
From text to a convincing imitation of any enrolled speaker.
Audio deepfakes follow two parallel tracks. Text-to-speech (TTS) systems like VALL-E (Wang et al., 2023) synthesise speech in a target speaker's voice directly from text, given a short reference recording. Voice conversion systems like SV2TTS (Jia et al., 2018) transform the timbre of one speaker's voice to match another's, preserving linguistic content. Both can produce convincing results from as little as three to ten seconds of reference audio, and both have been used in real-world fraud: the 2019 UK energy-firm CEO impersonation case is the first widely documented instance, where attackers cloned a voice to authorise a £200,000 transfer.
Forensic audio analysis of suspected voice clones draws on the same toolkit as traditional speaker identification, but the discriminative features shift. The question is no longer whether two voices come from the same biological speaker; it is whether the challenged recording bears the statistical marks of a synthesis pipeline. Anti-spoofing research, catalogued in the ASVspoof benchmark series run since 2015, provides standardised test conditions and published equal-error-rate benchmarks for detection systems.
You cannot detect what you do not know was made.
Detection methods are downstream of generation methods. A system trained on GAN artefacts will not reliably flag diffusion model output, because the artefacts are architecturally specific. This is the detection-generalisation problem, and it is arguably the central challenge in deepfake forensics: new generators are released regularly, often with architectural changes that invalidate earlier detectors, and ground-truth-labelled training data for the newest systems lags by months or years.
A forensic analyst who understands the generation pipeline can reason about which artefacts to expect even without a trained detector for the specific system. If a face-swap pipeline was used, look for blending inconsistencies at the face boundary and compression-domain discrepancies between face and background regions. If a diffusion model was used, look for the absence of camera-native noise and the presence of U-Net spectral periodicity. This generation-informed reasoning is a more reliable first-pass approach than any single binary classifier.
What makes GANs produce a checkerboard artefact detectable in the frequency domain?
Test yourself on Forensic Audio, Video and Image Analysis with free, timed mocks.
Practice Forensic Audio, Video and Image Analysis questionsSpotted an error in this page? Report a correction or read our editorial standards.