How Deepfakes Are Generated: GANs, Diffusion and Face Swap Pipelines
Deepfakes are synthetic media produced by machine learning models that replace, animate, or entirely fabricate human faces and voices. This topic explains how generative adversarial networks, diffusion models, and dedicated face-swap tools such as FaceSwap and DeepFaceLab work, and how each architecture leaves characteristic artifacts that inform detection strategy.
Last updated:
A deepfake is a piece of media in which a person's face, voice, or both have been fabricated or replaced using machine learning. Three families of technology underpin nearly all deepfakes encountered in forensic casework: generative adversarial networks (GANs), diffusion models, and dedicated face-swap pipelines such as FaceSwap and DeepFaceLab. Each family works by a different mechanism, produces different visual output, and leaves a different set of statistical and structural artifacts. Forensic examiners who understand how the generation process works can select the right detection methods, interpret ambiguous findings correctly, and explain the technical evidence to courts without overstating certainty.
GANs dominated deepfake production from roughly 2017 to 2022. A GAN trains two neural networks in opposition: a generator that synthesises images and a discriminator that tries to distinguish synthetic from real ones. The competition drives the generator toward outputs that pass the discriminator's test, which means they also become harder for human observers to detect. GAN outputs carry a characteristic spectral signature from upsampling operations and tend to struggle with high-frequency detail such as hair, teeth, and background edges. Since 2022, diffusion models have become the dominant synthesis architecture for high-quality image generation. They work by a fundamentally different mechanism, reversing a learned noise-addition process, and leave a different artifact profile. Face-swap pipelines sit alongside both: they use autoencoders trained on specific individuals to transplant one person's identity onto another person's video while preserving head pose and expression.
Courts in multiple jurisdictions, including proceedings under the US Federal Rules of Evidence, the UK's Police and Criminal Evidence Act 1984, the EU's Digital Services Act framework for harmful synthetic media, and India's Bharatiya Sakshya Adhiniyam 2023, have encountered deepfake evidence or deepfake-based defences. In each context, the forensic examiner must not only say whether a piece of media is authentic but also explain which generation mechanism was used and why the detected artifacts are meaningful. That explanation requires a working knowledge of the generation process.
By the end of this topic you will be able to:
- Explain the generator-discriminator training loop in a GAN and identify the upsampling operations that produce checkerboard artifacts in the Fourier spectrum.
- Describe how diffusion models generate images through iterative denoising and explain why their artifact profile differs from GAN outputs.
- Trace the face-swap pipeline used by FaceSwap and DeepFaceLab from training data collection through autoencoder training to frame-level blending.
- Identify the characteristic artifacts left by each generation architecture and map each artifact type to the detection method that targets it.
- Describe how architecture choice affects the strategy and limitations of forensic deepfake detection in a casework or court context.
- Generative Adversarial Network (GAN)
- A machine learning architecture consisting of two networks trained simultaneously: a generator that produces synthetic images and a discriminator that classifies images as real or synthetic. Adversarial training drives both networks toward equilibrium, producing increasingly convincing output.
- Diffusion model
- A generative model trained to reverse a noise-addition process. During inference, the model starts from random Gaussian noise and iteratively removes noise over many steps to produce a coherent image. Stable Diffusion and DALL-E 3 are widely deployed examples.
- Autoencoder
- A neural network that compresses an input into a compact latent representation (encoder) then reconstructs it (decoder). Face-swap pipelines train autoencoders with a shared encoder but separate decoders for each identity, enabling identity transplantation.
- Latent space
- The compressed internal representation learned by an encoder. Manipulating a point in latent space changes the corresponding output image in predictable ways, which is why latent-space editing can change expressions, lighting, or age while keeping identity consistent.
- Checkerboard artifact
- A grid-like pattern visible in the Fourier power spectrum of GAN outputs, caused by transposed convolution or bilinear-upsampling operations used to increase image resolution in the generator. The pattern is often invisible to the eye but detectable by spectral analysis.
- Blending mask
- In face-swap pipelines, a pixel-level mask that defines the face region to be composited onto the target frame. Imperfect masks leave boundary artifacts: color discontinuities, sharpness mismatches, or halo effects at the face perimeter.
Generative Adversarial Networks: architecture and artifact signature
Ian Goodfellow and colleagues introduced GANs in 2014. The core idea is straightforward: train two networks in competition. The generator takes a random noise vector and produces an image. The discriminator receives either a real image from the training set or a generator output, and must classify which it received. Each network updates its weights based on how well it performs its task. Over training iterations, the generator learns to produce images the discriminator cannot reliably classify as fake; the discriminator becomes a more demanding critic, pushing the generator further.
Face-specific GAN architectures include PGGAN (Progressive Growing of GANs, 2018), StyleGAN (2019), and StyleGAN2 (2020). These models generate high-resolution faces by starting from low-resolution synthesis and progressively adding detail layers. StyleGAN introduced a mapping network that converts noise into a style code controlling different levels of the image hierarchy: coarse structure at the low-resolution layers, fine texture at the high-resolution layers. The system allows fine-grained control over attributes such as age, gender, and hair, which made StyleGAN outputs common in synthetic identity fraud operations.
Beyond the spectral signature, GANs have known failure modes in specific image regions. Teeth often show unnatural regularity or merged boundaries between individual teeth. Hair at the periphery of the face tends to blend into the background with unrealistic softness. Earrings and jewelry are common failure points because their structure is inconsistent with adjacent background. The iris and pupil occasionally show bilateral asymmetry. These local anomalies are targets for region-specific forensic analysis and are the reason that full-face authentication pipelines apply different scrutiny to different facial zones.
Diffusion models: denoising as generation
Diffusion models take a different approach to generation. During training, the model learns to predict the noise added to a real image at each of a series of timesteps. This is the forward diffusion process: starting from a clean image, progressively add Gaussian noise until the image is indistinguishable from pure noise. The model learns to reverse this, predicting the noise at each step. During inference, the model starts from pure noise and applies the reverse process iteratively, producing a coherent image after many steps.
Latent diffusion models, such as the Stable Diffusion architecture, apply the diffusion process in a compressed latent space rather than pixel space, reducing computational cost. A variational autoencoder encodes the input image into a latent representation; the diffusion model operates on this latent representation; a decoder then converts the denoised latent back to pixel space. Text conditioning is added through a cross-attention mechanism that allows a text prompt to guide the denoising direction. This is why systems like Stable Diffusion and Midjourney can generate faces matching a detailed text description.
Diffusion outputs do not carry the same spectral checkerboard signature as GAN outputs because the upsampling operations in the VAE decoder differ from those in GAN generators, and the iterative denoising process distributes frequency content differently. However, diffusion-generated images have their own forensic signatures. The denoising process tends to over-smooth fine texture in specific frequency bands. The VAE decoder introduces its own reconstruction artifacts that can be detected through analysis of local noise statistics. Some detection methods exploit the fact that the reverse diffusion process produces images with a characteristic noise residual structure that differs from the noise profile of camera-captured images.
| Property | GAN output | Diffusion model output |
|---|---|---|
| Primary artifact | Spectral checkerboard grid from upsampling | VAE decoder artifacts, over-smoothed texture bands |
| Generation speed | Fast (single forward pass) | Slow (many iterative denoising steps) |
| Resolution control | Fixed output resolution per model | Flexible; latent space interpolation |
| Text-guided synthesis | Requires additional conditioning networks | Native via cross-attention |
| Detection difficulty | Established spectral methods available | Active research area; fewer proven tools |
| Common forensic tools | Fourier analysis, GAN fingerprint classifiers | Noise residual analysis, VAE artifact detection |
Face-swap pipelines: FaceSwap and DeepFaceLab
FaceSwap and DeepFaceLab are open-source pipelines built for identity transplantation rather than face generation. The goal is not to create a new face but to replace one person's face in a video with another person's face while preserving head pose, lighting, and expression. Both tools share the same fundamental architecture: a shared-encoder, split-decoder autoencoder.
The pipeline works in three stages. First, training data collection: thousands of face images are extracted from source video (person A, the identity to be transplanted) and target video (person B, the person whose footage is being manipulated). Facial landmarks are detected and the face region is cropped and aligned to a standard pose. Second, autoencoder training: a single encoder is trained on both face sets simultaneously. Two separate decoders are trained, one for person A and one for person B. The shared encoder learns a latent representation that captures expression and pose. Each decoder learns to reconstruct its person's identity from that shared representation. After training, feeding the encoder's output from a person B image through person A's decoder produces a reconstruction of person A's face with person B's expression and head angle. Third, synthesis and blending: for each target video frame, the face region is extracted, encoded, decoded through person A's decoder, then blended back into the original frame using a mask. The blending step adjusts color to match the surrounding skin tone and applies feathering at the mask edge.
DeepFaceLab offers several model variants with different capacity and artifact profiles. The SAEHD model (Structural Autoencoder for High Detail) is the most common; it adds attention mechanisms to preserve fine details. The Quick96 model sacrifices detail for speed. Each variant produces outputs with subtly different artifact characteristics, which means a detection model trained on one variant may not generalise to another. For casework, examiners should record which pipeline variant the suspected output resembles and be cautious about claiming detection certainty based on a single detection classifier.
Other generation methods: face reenactment and neural radiance fields
Face reenactment methods, such as First Order Motion Model and Face2Face, animate a static source image using motion signals from a target video. The source image provides identity; the target video provides movement. The generator learns to warp the source face to match the target person's expressions without retraining on the target person's data. This makes reenactment methods significantly easier to deploy than face-swap pipelines, because no training data collection from the target person is required. The output is a video in which the source person's face appears to mimic the target person's speech and expression.
Neural radiance fields (NeRF) and related 3D-aware synthesis methods represent a newer generation of tools. They model a person's face as a 3D implicit function, allowing synthesis of arbitrary viewpoints from a small number of training images. Systems such as NeRFace and Next3D extend this to dynamic expressions. The 3D-aware generation process produces more geometrically consistent outputs: lighting behaves correctly under head rotation in a way that 2D-based GANs often fail to replicate. Forensic detection of NeRF-based deepfakes is an active research area with few validated methods available for casework as of mid-2025.
Commercial platforms including Runway, Pika, and Sora generate video from text or image prompts using diffusion-based video models. These systems produce temporal coherence by conditioning each frame on the previous one through a temporal attention mechanism. The artifact profile of video diffusion models includes inter-frame inconsistency in fine detail, unnatural motion of secondary elements such as background and hair, and occasional identity drift where facial features shift slightly across a long clip. Authentication of video diffusion outputs requires temporal analysis in addition to frame-level spatial analysis.
How generation architecture determines detection strategy
Detection methods are not architecture-agnostic. A classifier trained on GAN outputs will not reliably detect diffusion outputs, and vice versa. This has direct consequences for casework. When a piece of disputed media arrives, the examiner's first task is to form a hypothesis about the generation architecture. Visual inspection of known GAN failure modes, spectral analysis for checkerboard signatures, and boundary analysis for face-swap blending artifacts can all contribute to architecture identification before formal detection is applied.
For GAN outputs, frequency-domain analysis is the most established approach. The 2D discrete Fourier transform of the image is computed, and the power spectrum is examined for periodic peaks corresponding to the generator's upsampling stride. Methods such as those developed by Dzanic et al. (2020) and Corvi et al. (2023) formalise this into classifiers that operate on the frequency representation. Noise residual extraction, subtracting a denoised version of the image from the original, isolates the camera-sensor noise pattern or its absence, which is a secondary detection signal used in PRNU-based authentication.
For face-swap outputs, the most productive forensic approaches target inconsistency between the face region and the surrounding frame. Methods include: local noise level estimation inside and outside the face boundary, lighting direction estimation from skin highlights and shadows on both sides of the boundary, blur radius comparison, and frequency content comparison. The noise inconsistency and lighting analysis methods developed for copy-move and splicing detection apply directly to face-swap forensics because the underlying manipulation is a compositing operation.
For diffusion outputs, validated methods are fewer but the research community has converged on several approaches. One class of method exploits the reconstruction property: running a diffusion inversion on a real image and then re-denoising produces output very close to the original; running the same process on a synthetic image produces larger reconstruction errors because the image already occupies a low-probability region of the natural image manifold. Another class analyses the local noise residual structure: camera captures contain spatially correlated noise from the sensor, while diffusion outputs contain a different residual pattern from the VAE decoder. A third approach uses foundation model embeddings: real and synthetic images cluster differently in the feature space of large vision models such as CLIP.
Evidence and admissibility considerations
Deepfake detection findings are increasingly presented in criminal and civil proceedings. The admissibility framework differs by jurisdiction. In the United States, Daubert v. Merrell Dow Pharmaceuticals (1993) requires federal courts to evaluate scientific evidence on the basis of testability, peer review, known error rate, and general acceptance in the scientific community. Deepfake detection methods based on published, peer-reviewed classifiers with documented false-positive rates satisfy Daubert's requirements more readily than proprietary black-box tools. In the United Kingdom, expert evidence is governed by the Criminal Procedure Rules Part 19, which require the expert to state the basis of the opinion, including the methodology and its limitations. The EU's Digital Services Act 2023 requires platforms to label synthetic media, creating a parallel disclosure framework that may generate provenance metadata useful to courts.
Under India's Bharatiya Sakshya Adhiniyam 2023, electronic records are admissible with a certificate attesting their source and integrity under Section 63. Deepfake evidence raises additional questions: the certificate attests that the file was retrieved from a specific device or account, but not that the content is authentic. Forensic analysis of the generation architecture and artifact profile provides the substantive evidence of authenticity, complementing the procedural certificate. Indian courts have not yet developed a settled standard equivalent to Daubert for scientific evidence, but the BSA's provisions on expert opinion evidence under Section 57 apply.
The C2PA (Coalition for Content Provenance and Authenticity) standard, adopted by Adobe, Microsoft, and major camera manufacturers, attaches a cryptographically signed provenance manifest to media files at capture and at each editing step. A video carrying a valid C2PA manifest provides a verifiable chain of custody from camera sensor to presentation. A deepfake will either lack a C2PA manifest entirely or carry a manifest signed by the synthesis tool rather than by a camera. For authenticated media, C2PA verification should be the first check, with artifact-based forensic analysis reserved for media that lacks a manifest or where the manifest cannot be verified. The chain of custody for digital media topic covers the provenance framework in detail.
What causes the checkerboard artifact seen in the Fourier power spectrum of many GAN-generated faces?
Key Takeaways
- Three main architectures produce deepfakes: GANs, which use adversarial training to push a generator toward photorealistic output; diffusion models, which generate images by iteratively reversing a noise-addition process; and face-swap pipelines such as FaceSwap and DeepFaceLab, which transplant one person's identity onto another person's video using shared-encoder, split-decoder autoencoders.
- GAN generators leave a spectral checkerboard signature in the Fourier power spectrum due to upsampling operations, and fail reliably on high-frequency detail including teeth, hair boundaries, and earrings. These are the primary detection targets for GAN-generated media.
- Diffusion models produce higher-quality output without the GAN spectral signature, but carry their own artifact profile including VAE decoder reconstruction artifacts and characteristic noise residual structure. Detection methods for diffusion outputs rely on noise analysis and diffusion inversion tests rather than spectral grid analysis.
- Face-swap pipelines are best detected through boundary inconsistency analysis: comparing noise level, blur radius, and lighting direction inside versus outside the face region, because the compositing step creates measurable discontinuities at the face perimeter.
- Architecture identification should precede detection method selection in casework. Using a GAN classifier on a diffusion output, or a frame-level classifier on a temporally manipulated video, risks false negatives. C2PA provenance verification is the appropriate first check when the file may carry a signed manifest.
What is the difference between a GAN and a diffusion model for face synthesis?
How does DeepFaceLab differ from a face generation model?
Why do deepfakes often fail at the face boundary?
What artifacts do GAN-generated images leave that help forensic examiners?
How does the choice of generation architecture affect the forensic investigation strategy?
Test yourself on Multimedia Authentication and Deepfake Forensics with free, timed mocks.
Practice Multimedia Authentication and Deepfake Forensics questionsSpotted an error in this page? Report a correction or read our editorial standards.