Practice with mock tests, learn from structured notes, and get your questions answered by a global forensic community, all in one place.
Automated facial recognition uses deep neural networks to generate similarity scores between facial images; understanding how these systems work, where they fail, and how their outputs should be used is essential for any forensic practitioner or court that encounters them.
Last updated:
Facial recognition technology is now embedded in phones, border checkpoints, retail loss-prevention systems, and police intelligence platforms. Somewhere between a camera capturing a face and a result appearing on a screen, a deep neural network has compared that face against a database of millions. The system operates at a speed and scale no human examiner can match. But speed and scale are not the same thing as accuracy, and accuracy for one demographic group is not accuracy for another. Forensic practitioners who work with images, and the courts that receive their evidence, need to understand what automated facial recognition actually does before deciding how much weight to give it.
The best-known deep-learning architectures for facial recognition, including ArcFace and FaceNet, work by converting a face image into a compact numerical representation (an embedding) and then measuring the geometric distance between embeddings. If two embeddings are close enough, the system reports a probable match. The threshold for what counts as close enough is a policy decision as much as a technical one, and moving it shifts the trade-off between false positives and false negatives. No threshold eliminates both.
This topic covers three things: how the core technology works at a level that a forensic practitioner needs to understand; what NIST's extensive benchmark testing reveals about where current systems succeed and where they fail; and how the legal and regulatory frameworks in the UK, EU, and elsewhere are beginning to catch up with the technology. The tension between the investigative utility of AFR and its risks as courtroom evidence is a live question, and the answers being built right now will shape criminal justice for the next decade.
Under every recognition decision is a number. Here is where that number comes from.
A deep convolutional neural network for face recognition is trained on millions of face images labelled by identity. Through training, the network learns to map any face image to a fixed-length embedding vector, such that images of the same person land close together in the embedding space and images of different people land far apart. The distance metric is typically cosine similarity or Euclidean distance in the embedding space.
FaceNet, published by Florian Schroff and colleagues at Google in 2015, was influential in showing that a single network trained end-to-end on raw pixels could achieve high accuracy with 128-dimensional embeddings. ArcFace, published by Jiankang Deng and colleagues in 2019, improved on the training loss function by introducing an additive angular margin that forces the network to push embeddings for different identities further apart during training. Both architectures share the same operational logic: produce an embedding, compare embeddings, return a score.
At deployment, a forensic unit might use a system that returns the top-N gallery candidates ranked by similarity score, with a threshold applied to filter out low-confidence candidates. The investigator sees a shortlist of possible matches. What the investigator does not see is the underlying embedding arithmetic or the training data that shaped it. This opacity is one reason why a trained human examiner reviewing the candidate images is considered essential before any result is presented as evidence.
The most comprehensive public dataset on what facial recognition can and cannot do.
The National Institute of Standards and Technology has run the Face Recognition Vendor Testing programme since the early 2000s, regularly publishing detailed reports on the accuracy of commercial and research algorithms tested against standardised datasets. The FRVT is the most comprehensive independent benchmark in the world, and its findings are the primary evidence base for understanding where current systems stand.
The headline finding from NIST FRVT 1:1 Verification testing (which measures whether two images are of the same person) is that the best current algorithms achieve false match rates below one in a million at false non-match rates below 1% on cooperative, controlled images. This is impressive performance. The immediately relevant caveat is that forensic casework almost never involves cooperative, controlled images. It involves CCTV footage of variable quality, partial occlusion, non-frontal pose, and subjects who are not looking at the camera. Accuracy on these harder conditions degrades substantially.
The FRVT demographic differentials report (2019 and subsequent updates, authored by Patrick Grother and colleagues) found that most algorithms have markedly higher false non-match rates for women, for older adults, and for Black and East Asian faces relative to their performance on white male faces in the prime adult range. False positive rates (wrong identifications) are also elevated for some groups depending on the algorithm. These are not small differences. For some algorithms the false positive rate was 10 to 100 times higher for certain demographic groups than for the best-performing group.
| Accuracy dimension | Best-performing group (typically) | Most affected group (typically) | Implication for casework |
|---|---|---|---|
| False non-match rate | Young adult males, controlled images | Older adults, some ethnic groups, poor-quality probe images | Higher risk of missing a genuine match for these groups |
| False match rate | Varies by algorithm and demographic interaction | Some ethnic minority groups under certain algorithms | Higher risk of a false identification for these groups |
| Overall 1:N search accuracy | Controlled gallery and probe images | Surveillance-quality probe images | Significant accuracy drop from lab benchmarks to operational conditions |
Using AFR to find a lead is not the same as presenting AFR output in court.
The distinction between investigative and evidential use of automated facial recognition is the central policy question in this area, and it is one that courts, police forces, and regulators are actively working through. The distinction is clean in principle and messy in practice.
In investigative use, AFR acts as an intelligence tool. A probe image from a crime scene is run against a custody or passport database, and the system returns a ranked list of candidate identities. A trained detective or examiner then reviews the candidates, applies investigative judgment, and decides whether to pursue any of them. The AFR output is a starting point for human inquiry, not a conclusion. If the algorithm was wrong, the human review is supposed to catch it. This is analogous to using a fingerprint database to generate AFIS candidates, which are then verified by a fingerprint examiner.
In evidential use, a report based on the facial comparison is placed before a court to support identification. The critical question is: what is the report actually saying? If the examiner has independently reviewed the images and formed an opinion based on the morphological comparison, the AFR system is simply how the examiner came to look at these particular two images, and the evidence is the examiner's opinion. If the report instead presents the AFR similarity score as the evidence of identification, that is a different and more problematic claim, because the court has no ability to interrogate the algorithm's reasoning.
Knowing a system's limitations is not optional when your opinion might send someone to prison.
The demographic accuracy disparities documented in NIST FRVT create an expert-witness obligation that is worth spelling out plainly. If an expert uses an AFR system to generate a candidate match and then presents evidence that the images depict the same person, that expert must be aware of, and must disclose to the court, the accuracy characteristics of the algorithm they used. An expert who presents AFR-assisted identification evidence without knowing or disclosing the relevant false match rates and demographic differentials is not fulfilling the duty of candour that expert witnesses owe to the court.
This is not a hypothetical concern. In the United States several high-profile wrongful arrests have been linked to AFR false matches, notably the cases of Robert Williams (2020, Michigan), Michael Oliver (2020, Michigan), and Nijeer Parks (2019, New Jersey). In each case a Black man was arrested based primarily on an AFR match that an investigating officer did not independently verify adequately against the actual images. None of these cases reached court through qualified expert facial comparison; they reached arrest through AFR-assisted investigation that failed the human-review step.
Europe is building the first comprehensive legal framework for facial recognition in public spaces.
The EU Artificial Intelligence Act, which entered force in August 2024, is the most detailed legal framework yet developed for regulating automated facial recognition. It classifies AI systems by risk and applies strictest requirements to the highest-risk categories. For facial recognition specifically, the Act distinguishes between real-time remote biometric identification (live AFR in public spaces) and post-event identification (AFR applied to recorded footage after an incident).
Real-time remote biometric identification by law enforcement in public spaces is, as a general rule, prohibited under the Act, with narrow exceptions: searching for victims of certain crimes (trafficking, sexual exploitation), responding to specific imminent terrorist threats, or locating suspects of serious criminal offences punishable by a certain custodial sentence. Each exception requires prior judicial or independent administrative authorisation and is geographically and temporally bounded. This represents the most restrictive approach any major jurisdiction has taken toward live AFR.
Post-event AFR used in criminal investigations falls into the high-risk category under Annex III of the Act. High-risk systems must meet requirements for risk management, training data quality documentation, technical documentation, logging, transparency to deploying authorities, human oversight, accuracy and robustness, and cybersecurity. Deploying authorities must also register the system in the EU's AI database. These requirements, when implemented, will create a documented trail of performance characteristics and oversight decisions that could become relevant in any criminal proceeding that relied on the system.
The algorithm finds candidates. The examiner decides.
The current professional consensus, reflected in guidance from the National Institute of Justice (US), the College of Policing (UK), and the European Network of Forensic Science Institutes (ENFSI), is that AFR outputs should be treated as investigative leads requiring human expert review before any evidential conclusion is drawn. This two-stage model is not merely a procedural formality. It is the methodological safeguard that separates a case built on expert opinion from one built on algorithm output.
ArcFace and FaceNet produce facial recognition results primarily by:
Test yourself on Forensic Audio, Video and Image Analysis with free, timed mocks.
Practice Forensic Audio, Video and Image Analysis questionsSpotted an error in this page? Report a correction or read our editorial standards.