Deepfake Generation: GANs, Diffusion, and Face-Swap Pipelines

A technical walkthrough of how deepfakes are made, covering GAN-based face swapping, diffusion model synthesis, neural talking heads, and voice cloning, and why understanding these generation artefacts is the first step in detecting them.

Last updated: 19 Jun 2026

Deepfakes are synthetic media produced by one of four main generation families: GAN-based face swapping, encoder-decoder face replacement, diffusion model image synthesis, and neural voice cloning. Each method leaves a distinct set of artefacts tied to its architecture, from checkerboard frequency patterns in GAN upsampling layers to the absence of camera-native noise in diffusion outputs. Understanding these generation-specific signatures is the prerequisite for detection, because detectors trained on one family's artefacts perform unreliably against another. Forensic analysis therefore begins with identifying which pipeline produced the media before applying any detection tool.

In 2017 a Reddit user operating under the handle "deepfakes" published face-swapped videos built with open-source machine learning tools. Within two years the techniques were accessible on consumer laptops and packaged into professional toolkits. The term deepfake now covers four distinct generation families: GAN-based face swapping, encoder-decoder face replacement, diffusion model image synthesis, and neural voice cloning. Each leaves traces of how it was made.

Understanding how deepfakes are generated is not a side note for forensic scientists. It is the prerequisite for detection. A detector that does not know what kind of artefact it is looking for will perform unreliably, particularly against generators it has never seen. The checkerboard patterns that betray one GAN architecture do not appear in a diffusion model's output. The blending seam typical of a face-swap pipeline does not exist in a fully synthetic image. Each generation method leaves its own signature, and a forensic analyst needs to know which signatures to look for.

This topic walks through the main generation pipelines in enough technical depth to ground detection work. It covers the adversarial training dynamic that defines GANs, the encoder-decoder architecture behind tools like DeepFaceLab, the neural-rendering and NeRF-based talking-head systems used in political disinformation, diffusion model synthesis and its latent-space properties, and voice cloning via text-to-speech and voice conversion. Each section ends with the forensic relevance of the artefacts that method introduces.

By the end of this topic you will be able to:

Describe the adversarial training dynamic of GANs and explain why transposed-convolution upsampling produces checkerboard artefacts visible in the Fourier domain.
Distinguish face-swap encoder-decoder pipelines from full-synthesis methods and identify the forensic signals each leaves at the face boundary and in compression domains.
Explain how NeRF-based talking-head systems animate a static photograph and specify which regions produce characteristic artefacts.
Compare diffusion model outputs with camera-captured photographs in terms of sensor noise, EXIF provenance, and spectral properties.
List the main forensic indicators of voice-cloned audio, including MFCC distribution differences, prosodic flatness, and double-compression codec signatures.

Key terms

GAN: Generative Adversarial Network. A framework with two neural networks, a generator that creates synthetic data and a discriminator that tries to distinguish it from real data. They train together in an adversarial loop until the generator's output is difficult to separate from genuine content.
Encoder-decoder: A neural architecture where an encoder compresses an input into a compact latent representation and a decoder reconstructs an output image from it. Face-swap tools train separate decoders for each identity on a shared encoder, allowing one face's expression to drive another's appearance.
Diffusion model: A generative model trained to reverse a step-by-step noise-addition process. Starting from random noise, a trained diffusion model progressively denoises toward a coherent image matching a text or image prompt. Stable Diffusion and DALL-E 3 are well-known implementations.
NeRF (Neural Radiance Field): A neural representation that encodes a 3-D scene as a continuous volumetric function, allowing novel viewpoints to be rendered. In talking-head systems, a NeRF-based model can generate new head poses and lip movements from a single portrait photograph.
Voice conversion: Transforming the timbre and identity of one speaker's voice to match another while preserving the linguistic content. Used in voice-cloning attacks to impersonate a target using only a short reference recording.
Latent space: The compressed, lower-dimensional representation of data learned by a neural network's internal layers. Generative models sample from or navigate this space to produce new outputs; its statistical properties differ measurably from the space occupied by authentic photographs.

GAN architecture and the adversarial training loop

Ian Goodfellow and colleagues introduced the Generative Adversarial Network framework in 2014. The core mechanism pairs a generator network with a discriminator network: the generator produces synthetic images; the discriminator classifies inputs as real or fake. Both networks update from the discriminator's error signal in an iterative loop. At convergence the generator has learned to sample from a distribution that closely resembles the training data.

GAN adversarial training loop.

For face generation specifically, architectures such as StyleGAN2 (Karras et al., 2020) and ProGAN extended the basic framework with progressive training, style-based control of facial features, and attention mechanisms. These models can produce 1024x1024 portraits that are almost indistinguishable from photographs to an untrained observer. Forensically, they leave characteristic artefacts in the high-frequency spectrum: a checkerboard pattern introduced by transposed convolutions in the upsampling layers, and a statistical distribution of pixel intensity that differs from the noise structure of a real camera sensor.

Face-swap encoder-decoder pipelines

The original "deepfakes" method, later formalised in tools such as DeepFaceLab and FaceSwap, uses a shared encoder with two identity-specific decoders. Both decoders are trained to reconstruct their respective target's face from a shared compressed representation of facial geometry and expression. At inference, the substitution works as follows: frames of person A are encoded, decoded with B's decoder, and blended back into the source video. Person A's head movements and expressions appear on person B's face.

Data collection and alignment
Hundreds to thousands of photographs of both identities are collected, then aligned to a common facial landmark template. Quality and volume of training data are the main determinants of final fidelity.
Encoder-decoder training
A single encoder and two identity-specific decoders are trained jointly to compress and reconstruct each person's face. The shared encoder forces both decoders to work in the same latent geometry.
Face extraction and warping
At inference, each frame of the source video is passed through the encoder, then through the target's decoder. The result is warped to align with the source face's exact position and pose.
Blending and post-processing
The synthetic face is composited back into the source frame using mask-based blending. Poor blending leaves a visible seam at the face boundary, which is a classic forensic marker for this generation family.

Forensically, face-swap videos carry a split provenance. The background, lighting, camera motion blur, and compression artefacts all derive from the original source video. Only the face region is synthesised. This creates detectable inconsistencies: the face region may have different noise characteristics, different JPEG block patterns, or response properties inconsistent with the illumination of the surrounding scene. These are the signals that spatial and compression-domain detectors exploit.

Face-swap encoder-decoder: shared encoder, two decoders.

Neural talking heads and NeRF-based synthesis

Face-swap pipelines require video of the source identity. NeRF-based talking-head systems do not. Given a single photograph or a short video clip, these models build a volumetric neural representation of the subject's head and animate it with arbitrary head pose and lip movements driven by an audio track. Systems such as SadTalker (Zhang et al., 2023) and AD-NeRF produce output in which a real person appears to say words they never said.

The forensic challenge with talking-head systems is that they often produce fewer blending seams than classic face swaps, because the entire face is rendered from the volumetric model rather than transplanted from a source video. Artefacts instead appear in areas of fine detail: teeth, inner mouth structures, and the boundary between the head and the background. The temporal consistency of natural head motion is also difficult to fully replicate, and subtle jitter or motion-vector irregularities can be detected in compressed video bitstreams.

Diffusion model synthesis

Diffusion models, developed in the 2020-2022 research wave, operate on a different principle from GANs. Training teaches the model to reverse the effect of adding Gaussian noise to an image step by step. At generation time it starts from pure noise and repeatedly applies the learned denoising function, guided by a text or image prompt, until a coherent image emerges. Stable Diffusion (Rombach et al., 2022) runs this process in a compressed latent space rather than pixel space, making high-resolution generation feasible on consumer hardware.

Property	GAN output	Diffusion model output
Generation speed	Fast: single forward pass	Slower: many denoising steps
Spectral artefacts	Checkerboard from transposed convolution	High-frequency periodicity from U-Net upsampling
Blending seams	Present at face boundary in face-swap use	Absent in fully synthetic output
Out-of-distribution handling	Fails at uncommon poses/lighting	Better generalisation via text conditioning
Forensic fingerprint source	Generator weights as a device fingerprint	Diffusion schedule and U-Net architecture
Detection maturity	Well-studied, many published detectors	Fewer published detectors as of 2024

From a forensic standpoint, diffusion model outputs lack camera-native properties. A photograph carries sensor noise (photon shot noise, read noise), optical vignetting, chromatic aberration, and EXIF metadata tied to a specific device. A diffusion-generated image has none of these. Its noise is the residual of the denoising process, its spectral energy distribution is statistically distinct, and it carries no authentic EXIF history. Detector approaches based on these properties can achieve high accuracy on known generators, but generalisation to new model families remains an open problem.

Voice cloning: TTS and voice conversion

Audio deepfakes follow two parallel tracks. Text-to-speech systems such as VALL-E (Wang et al., 2023) synthesise speech in a target speaker's voice directly from text, given a short reference recording. Voice conversion systems such as SV2TTS (Jia et al., 2018) transform the timbre of one speaker's voice to match another's while preserving linguistic content. Both can produce credible output from as little as three to ten seconds of reference audio. Both have been used in documented fraud: the 2019 UK energy-firm CEO impersonation case is the first widely cited instance, where attackers cloned a voice to authorise a €220,000 transfer.

Spectral artefacts: synthetic speech shows different mel-frequency cepstral coefficient (MFCC) distributions from natural speech, particularly in sibilant consonants and at phrase boundaries.
Prosodic flatness: current systems struggle to replicate the micro-variation in pitch, rate, and energy that characterises spontaneous natural speech.
Codec artefacts: voice clone audio transmitted over phone or messaging platforms picks up compression artefacts from two passes of codec encoding, one from synthesis and one from transmission, which leaves a detectable double-compression signature.
Background inconsistency: synthesised audio often lacks the characteristic room acoustics and background noise floor of genuine recordings, or that floor sounds artificially uniform.

Forensic audio analysis of suspected voice clones draws on the same toolkit as traditional speaker identification, but the discriminative features shift. The question is no longer whether two voices come from the same biological speaker; it is whether the challenged recording bears the statistical marks of a synthesis pipeline. Anti-spoofing research, catalogued in the ASVspoof benchmark series run since 2015, provides standardised test conditions and published equal-error-rate benchmarks for detection systems.

Why generation artefacts are the forensic entry point

Detection methods are downstream of generation methods. A system trained on GAN artefacts will not reliably flag diffusion model output, because the artefacts are architecturally specific. This is the detection-generalisation problem, the central challenge in deepfake forensics: new generators are released regularly with architectural changes that invalidate earlier detectors, and ground-truth-labelled training data for the newest systems lags by months or years.

A forensic analyst who understands the generation pipeline can reason about which artefacts to expect even without a trained detector for the specific system. A face-swap pipeline warrants inspection for blending inconsistencies at the face boundary and compression-domain discrepancies between face and background. A diffusion model output warrants inspection for the absence of camera-native noise and for U-Net spectral periodicity. Generation-informed reasoning of this kind is a more reliable first-pass approach than any single binary classifier.

Worked example

Identifying the generation method from a politician video

Working from the artefacts back to the pipeline.

A three-minute video of a senior official appears online showing them announcing a policy position they have publicly denied. The submitting investigator needs to determine not only whether the video is synthetic but which generation method was used, because that affects both the source-tracing strategy and the court presentation.

Compression and metadata inspection. The file has no authentic camera EXIF chain. It was re-encoded at H.264 with default ffmpeg settings, a common post-processing step in deepfake distribution.
Spatial consistency check. The face region shows a subtly different JPEG block structure from the surrounding frame, with less high-frequency detail in the skin texture. This pattern is consistent with a face-swap pipeline where the face was composited into a pre-existing video.
Illumination analysis. Specular highlights on the nose and forehead are directionally inconsistent with the ambient lighting in the background. The shadow under the chin falls in a direction incompatible with the room's window.
Frequency domain analysis. A 2-D Fourier transform of face-region patches shows faint checkerboard periodicity at the spatial frequency corresponding to the GAN upsampling stride. This rules out a diffusion-only generation.
Audio synchrony. Lip landmark movements extracted from the video correlate poorly with the audio waveform at 200-400 ms intervals. This temporal slip is typical of talking-head driving signals that were not derived from the original video's audio.
Conclusion. The visual evidence points to a face-swap or talking-head GAN pipeline applied to genuine source video. The GAN spectral fingerprint, the blending inconsistency, and the audio-lip desynchrony together constitute a convergent case, though no single signal alone is individually decisive.

The worked example shows why generation-method knowledge is operationally useful. The analyst did not need a specific trained detector: they reasoned from each generation family's known artefact profile to narrow the hypothesis space, then confirmed with measurable signals. This approach remains valid even against novel generator versions, as long as the architectural family is recognisable.

Check your understanding

Question 1 of 4· 0 answered

What makes GANs produce a checkerboard artefact detectable in the frequency domain?

Key Takeaways

GANs use adversarial training between a generator and a discriminator; transposed-convolution upsampling produces checkerboard artefacts measurable in the Fourier domain.
Face-swap encoder-decoder pipelines (DeepFaceLab, FaceSwap) composite a synthesised face onto genuine source video, creating detectable noise and compression-domain mismatches at the face boundary.
NeRF-based talking-head systems animate a static photograph, producing inconsistencies in teeth rendering, inner-mouth structure, and temporal motion vectors rather than blending seams.
Diffusion models (Stable Diffusion, DALL-E) generate images lacking camera-native noise and authentic EXIF, with spectral signatures from U-Net upsampling distinct from genuine photographs.
Voice cloning systems can impersonate a speaker from seconds of reference audio; forensic signals include MFCC distribution differences, prosodic flatness, and double-compression codec artefacts.
Detectors are generation-method specific; understanding the pipeline is the more reliable first-pass approach against novel generators than any single binary classifier.

What is the difference between a GAN and a diffusion model for generating faces?

A GAN uses two competing networks, a generator and a discriminator, trained simultaneously until the generator produces images the discriminator cannot reliably flag as fake. A diffusion model instead learns to reverse a gradual noise-addition process, reconstructing a clean image from random noise step by step. Diffusion models tend to produce fewer obvious artefacts and higher fidelity, but both leave detectable statistical fingerprints.

How does a face-swap pipeline differ from a full face synthesis pipeline?

A face-swap pipeline replaces one person's face on an existing video: it extracts the target face, warps the source identity's appearance onto it, and blends the result back into the original frames. Full synthesis pipelines like those used in Stable Diffusion generate the entire image from scratch. Forensically, face-swap outputs still carry the original video's lighting and compression history, while fully generated images do not.

Can voice cloning produce a convincing fake in real time?

Modern voice conversion systems can run at near-real-time latency on consumer hardware. VALL-E and similar transformer-based models can reproduce a speaker's timbre from as little as three seconds of reference audio. The quality is high enough to deceive casual listeners, but spectral and prosodic analysis can still detect artefacts in most current systems.

What latent-space signatures do diffusion models leave in generated images?

Diffusion model outputs carry spectral energy distributions that differ from camera-captured photographs, particularly in the high-frequency regime. They also often show slight periodicity patterns introduced by the U-Net denoising architecture and upsampling steps. These patterns are subtle but measurable with Fourier analysis or trained residual detectors.

Test yourself on Forensic Audio, Video and Image Analysis with free, timed mocks.

Practice Forensic Audio, Video and Image Analysis questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Key Takeaways

Your journey to becoming a forensic professional starts here.