Automated Facial Recognition: Role, Accuracy, and Limitations

Automated facial recognition uses deep neural networks to generate similarity scores between facial images; understanding how these systems work, where they fail, and how their outputs should be used is essential for any forensic practitioner or court that encounters them.

Last updated: 19 Jun 2026

Automated facial recognition (AFR) systems convert face images into numerical embedding vectors using deep convolutional neural networks, then rank candidates in a reference database by the geometric distance between embeddings. A similarity threshold determines what is reported as a candidate match, but that threshold is a policy decision: lowering it reduces false non-matches while increasing false identifications, and raising it does the reverse. NIST's Face Recognition Vendor Testing programme has shown that top algorithms achieve false match rates below one in a million on high-quality controlled images, but error rates increase substantially on surveillance-quality footage and differ markedly across demographic groups. In forensic practice, AFR output is an investigative lead requiring independent morphological review by a qualified examiner before any identification conclusion is presented to a court.

Automated facial recognition is now deployed in border checkpoints, police intelligence platforms, and retail loss-prevention systems. A deep neural network compares a probe face against a database of millions in seconds, at a scale no human examiner can match. Speed and scale are not the same as accuracy, and accuracy for one demographic group is not accuracy for another. Forensic practitioners and the courts that receive their evidence need to understand what AFR actually does before deciding how much weight to give its output.

The best-known deep-learning architectures for facial recognition, including ArcFace and FaceNet, work by converting a face image into a compact numerical representation (an embedding) and then measuring the geometric distance between embeddings. If two embeddings are close enough, the system reports a probable match. The threshold for what counts as close enough is a policy decision as much as a technical one, and moving it shifts the trade-off between false positives and false negatives. No threshold eliminates both.

This topic covers three things: how the core technology works at a level that a forensic practitioner needs to understand; what NIST's extensive benchmark testing reveals about where current systems succeed and where they fail; and how the legal and regulatory frameworks in the UK, EU, and elsewhere are beginning to catch up with the technology. The tension between the investigative utility of AFR and its risks as courtroom evidence is unresolved, and current legal and technical standards will shape how criminal justice systems handle it for years to come.

By the end of this topic you will be able to:

Explain how deep CNN facial recognition systems produce embeddings and use similarity thresholds to generate candidate matches.
Interpret NIST FRVT benchmark data, including what accuracy figures do and do not tell you about performance on surveillance-quality casework images.
Distinguish investigative use of AFR from evidential use, and state the conditions under which AFR-assisted results may be presented in court.
Identify the demographic accuracy disparities documented in NIST FRVT reports and articulate the expert-witness disclosure obligations they create.
Summarise the EU AI Act risk classifications for real-time and post-event AFR and apply the two-stage workflow recommended by ENFSI, NIJ, and the College of Policing.

Key terms

Embedding: A compact numerical vector representation of a face produced by a neural network. The distance between two embeddings in the vector space is used as the similarity metric between two face images.
Gallery vs. probe: In facial recognition, the gallery is the reference database of enrolled face images; the probe is the query image being compared against the gallery. A system searches the gallery for images close to the probe embedding.
False Match Rate (FMR): The proportion of non-mate comparisons (different individuals) that the system incorrectly scores as a match. Also called the false positive rate or false accept rate. A lower FMR means fewer wrong identifications.
False Non-Match Rate (FNMR): The proportion of mate comparisons (same individual) that the system incorrectly scores as non-matching. Also called the false negative rate. A lower FNMR means fewer failures to recognise the same person.
NIST FRVT: The National Institute of Standards and Technology Face Recognition Vendor Testing programme, the primary independent benchmark for facial recognition algorithms, publishing ongoing accuracy reports covering dozens of commercial and academic systems.
ArcFace / FaceNet: Two influential deep CNN facial recognition architectures. FaceNet (Google, 2015) used a triplet-loss training approach to learn facial embeddings. ArcFace (Deng et al., 2019) introduced an additive angular margin penalty that improved embedding discrimination and has become a widely adopted reference architecture.

How deep CNN facial recognition works

A deep convolutional neural network for face recognition is trained on millions of face images labelled by identity. Through training, the network learns to map any face image to a fixed-length embedding vector, such that images of the same person land close together in the embedding space and images of different people land far apart. The distance metric is typically cosine similarity or Euclidean distance in the embedding space.

FaceNet, published by Florian Schroff and colleagues at Google in 2015, was influential in showing that a single network trained end-to-end on raw pixels could achieve high accuracy with 128-dimensional embeddings. ArcFace, published by Jiankang Deng and colleagues in 2019, improved on the training loss function by introducing an additive angular margin that forces the network to push embeddings for different identities further apart during training. Both architectures share the same operational logic: produce an embedding, compare embeddings, return a score.

Deep CNN facial recognition pipeline from probe to candidate matches.

At deployment, a forensic unit might use a system that returns the top-N gallery candidates ranked by similarity score, with a threshold applied to filter out low-confidence candidates. The investigator sees a shortlist of possible matches. What the investigator does not see is the underlying embedding arithmetic or the training data that shaped it. This opacity is one reason why a trained human examiner reviewing the candidate images is considered essential before any result is presented as evidence.

NIST FRVT: what the benchmarks actually show

The National Institute of Standards and Technology has run the Face Recognition Vendor Testing programme since the early 2000s, regularly publishing detailed reports on the accuracy of commercial and research algorithms tested against standardised datasets. The FRVT is the most comprehensive independent benchmark in the world, and its findings are the primary evidence base for understanding where current systems stand.

The headline finding from NIST FRVT 1:1 Verification testing (which measures whether two images are of the same person) is that the best current algorithms achieve false match rates below one in a million at false non-match rates below 1% on cooperative, controlled images. This is impressive performance. The immediately relevant caveat is that forensic casework almost never involves cooperative, controlled images. It involves CCTV footage of variable quality, partial occlusion, non-frontal pose, and subjects who are not looking at the camera. Accuracy on these harder conditions degrades substantially.

The FRVT demographic differentials report (2019 and subsequent updates, authored by Patrick Grother and colleagues) found that most algorithms have markedly higher false non-match rates for women, for older adults, and for Black and East Asian faces relative to their performance on white male faces in the prime adult range. False positive rates (wrong identifications) are also elevated for some groups depending on the algorithm. These are not small differences. For some algorithms the false positive rate was 10 to 100 times higher for certain demographic groups than for the best-performing group.

Accuracy dimension	Best-performing group (typically)	Most affected group (typically)	Implication for casework
False non-match rate	Young adult males, controlled images	Older adults, some ethnic groups, poor-quality probe images	Higher risk of missing a genuine match for these groups
False match rate	Varies by algorithm and demographic interaction	Some ethnic minority groups under certain algorithms	Higher risk of a false identification for these groups
Overall 1:N search accuracy	Controlled gallery and probe images	Surveillance-quality probe images	Significant accuracy drop from lab benchmarks to operational conditions

Investigative versus evidential use

The distinction between investigative and evidential use of automated facial recognition is the central policy question in this area, and it is one that courts, police forces, and regulators are still working through. The principle is clear; consistent application is not.

In investigative use, AFR acts as an intelligence tool. A probe image from a crime scene is run against a custody or passport database, and the system returns a ranked list of candidate identities. A trained detective or examiner then reviews the candidates, applies investigative judgment, and decides whether to pursue any of them. The AFR output is a starting point for human inquiry, not a conclusion. If the algorithm was wrong, the human review is supposed to catch it. This is analogous to using a fingerprint database to generate AFIS candidates, which are then verified by a fingerprint examiner.

In evidential use, a report based on the facial comparison is placed before a court to support identification. The critical question is: what is the report actually saying? If the examiner has independently reviewed the images and formed an opinion based on the morphological comparison, the AFR system is simply how the examiner came to look at these particular two images, and the evidence is the examiner's opinion. If the report instead presents the AFR similarity score as the evidence of identification, that is a different and more problematic claim, because the court has no ability to interrogate the algorithm's reasoning.

Bias, false-match rates, and the expert's obligation

The demographic accuracy disparities documented in NIST FRVT create a direct expert-witness obligation. If an expert uses an AFR system to generate a candidate match and then presents evidence that the images depict the same person, that expert must be aware of, and must disclose to the court, the accuracy characteristics of the algorithm they used. An expert who presents AFR-assisted identification evidence without knowing or disclosing the relevant false match rates and demographic differentials is not fulfilling the duty of candour that expert witnesses owe to the court.

Documented cases illustrate the consequences. In the United States several high-profile wrongful arrests have been linked to AFR false matches, notably the cases of Robert Williams (2020, Michigan), Michael Oliver (2019, Michigan), and Nijeer Parks (2019, New Jersey). In each case a Black man was arrested based primarily on an AFR match that an investigating officer did not independently verify adequately against the actual images. None of these cases reached court through qualified expert facial comparison; they reached arrest through AFR-assisted investigation that failed the human-review step.

State the algorithm used: name the system and version in any report, as performance can differ between versions.
Disclose known demographics differentials: if NIST FRVT or other published data show elevated error rates for groups relevant to the subject of the probe image, this must be stated.
Separate algorithm output from examiner opinion: the report must make clear that the conclusion is the examiner's independent morphological assessment, not the algorithm's similarity score.
State image quality effects: benchmark accuracy does not apply to low-resolution, non-frontal, or occluded probe images. If the case images fall below the quality threshold at which the algorithm was validated, say so.

The EU AI Act and real-time biometric identification

The EU Artificial Intelligence Act, which entered force in August 2024, is the most detailed legal framework yet developed for regulating automated facial recognition. It classifies AI systems by risk and applies strictest requirements to the highest-risk categories. For facial recognition specifically, the Act distinguishes between real-time remote biometric identification (live AFR in public spaces) and post-event identification (AFR applied to recorded footage after an incident).

Real-time remote biometric identification by law enforcement in public spaces is, as a general rule, prohibited under the Act, with narrow exceptions: searching for victims of certain crimes (trafficking, sexual exploitation), responding to specific imminent terrorist threats, or locating suspects of serious criminal offences punishable by a certain custodial sentence. Each exception requires prior judicial or independent administrative authorisation and is geographically and temporally bounded. This represents the most restrictive approach any major jurisdiction has taken toward live AFR.

Post-event AFR used in criminal investigations falls into the high-risk category under Annex III of the Act. High-risk systems must meet requirements for risk management, training data quality documentation, technical documentation, logging, transparency to deploying authorities, human oversight, accuracy and robustness, and cybersecurity. Deploying authorities must also register the system in the EU's AI database. These requirements, when implemented, will create a documented trail of performance characteristics and oversight decisions that could become relevant in any criminal proceeding that relied on the system.

EU AI Act risk classification for facial recognition systems.

Combining AFR with expert examination: the recommended workflow

The current professional consensus, reflected in guidance from the National Institute of Justice (US), the College of Policing (UK), and the European Network of Forensic Science Institutes (ENFSI), is that AFR outputs should be treated as investigative leads requiring human expert review before any evidential conclusion is drawn. This two-stage model is not merely a procedural formality. It is the methodological safeguard that separates a case built on expert opinion from one built on algorithm output.

Image quality assessment
Before any comparison, the examiner assesses the quality and limitations of each image: resolution, pose, lighting, occlusion, and the time gap between images. Poor-quality images constrain what conclusions are possible; this must be stated upfront, not buried in a caveat at the end of the report.
AFR candidate generation (investigative step)
The probe image is run against the relevant gallery and a ranked candidate list is returned. The AFR output is a starting point, not a conclusion. The examiner records which system was used, the similarity score, and its rank in the list.
Independent morphological comparison (evidential step)
The examiner conducts a feature-by-feature and holistic comparison of the probe and candidate images, following the morphological methodology described in the facial comparison framework. This comparison must be documented fully and must stand on its own merits without relying on the AFR score.
Conclusion on FISWG scale
The examiner formulates a conclusion using the standardised scale, explicitly noting that the conclusion is based on the morphological comparison and not on the automated score. Any limitations that constrain the conclusion are stated.
Disclosure
The report discloses the AFR system used, its published performance characteristics (including any relevant demographic accuracy data), and the role it played in generating the candidate. Courts and opposing counsel are entitled to this information.

Worked example

AFR candidate confirmed and limited by expert review

The algorithm is right. The examiner is honest about why the conclusion has a ceiling.

A series of retail thefts at connected stores are suspected to be the work of one person. Forty-six CCTV frames are obtained across the incidents. Most show a partial face, downward gaze, or heavy hood. Three frames show the face at a usable angle, at approximately 90 pixels across the face width. A police unit runs the best frame through a gallery of custody images using a commercial AFR system (a major-vendor system with published NIST FRVT testing). The top-ranked candidate receives a similarity score above the system's threshold. A forensic facial image examiner is then engaged.

Image assessment. The examiner characterises the three usable frames: 90 pixels across the face is marginal for morphological comparison; lighting is from above and to the right, creating shadow across the left side; pose is approximately 15 degrees yaw left and slight chin-down tilt in two frames, more frontal in the third. The gallery reference image is a high-resolution custody photo taken frontally under controlled lighting.
Morphological pass. Working through available features: the nasal dorsum profile (slightly concave, consistent in both images), the left orbital shape and medial canthal angle (consistent, but with a confidence limitation due to shadow), and the overall facial width-to-height ratio (consistent). The auricular region is not visible in any usable frame. The lower face is largely obscured by a raised collar in two frames. In the third frame the mandibular outline and chin shape are visible and consistent with the custody photo.
Holistic synthesis. The examiner's overall gestalt impression is of broad consistency between the CCTV face and the custody photograph. No feature shows a clear discordance. But the limited number of fully assessable features means the holistic impression carries less weight than it would in a higher-quality comparison.
Conclusion. The examiner concludes Supports identification on the FISWG scale. The report explicitly states: the conclusion cannot be raised to Identification because the combination of marginal resolution, partial occlusion, and lighting shadow means that only a limited subset of facial features could be assessed. The AFR candidate was generated by a system with published FNMR of X% for the relevant demographic and image-quality conditions; the expert conclusion is independent of that score.

The defence obtains the report and is now in a position to understand the basis of the opinion, the limitations acknowledged by the examiner, and the performance characteristics of the AFR system. That transparency is what allows the court to weigh the evidence appropriately and what distinguishes proper forensic practice from an algorithm being dressed up as expert testimony.

Check your understanding

Question 1 of 4· 0 answered

ArcFace and FaceNet produce facial recognition results primarily by:

Key Takeaways

ArcFace, FaceNet, and similar deep CNN systems convert face images into embedding vectors and measure similarity by vector distance; a score above a threshold returns a candidate match, but the threshold choice always involves a trade-off between false positives and false negatives.
NIST FRVT benchmarks show that top algorithms achieve very low error rates on high-quality controlled images, but accuracy degrades significantly on surveillance-quality probe images and shows consistent demographic disparities that elevate error rates for some groups.
Investigative use of AFR (generating candidates for human review) is operationally standard; evidential use requires the expert's conclusion to rest on independent morphological examination, not on the algorithm's score.
R v. Atkins and Atkins [2009] confirmed admissibility of facial image comparison expert evidence in English courts; the case also reinforced that the examiner's trained opinion, not an automated output, is what constitutes the expert evidence.
The EU AI Act largely prohibits real-time AFR in public spaces for law enforcement and places post-event investigation AFR in the high-risk category with conformity, logging, and human-oversight requirements.
An expert using AFR must disclose the system and its performance characteristics, state any relevant demographic accuracy disparities, and ensure the court understands that the conclusion is based on human examination, not the algorithm.

What is the difference between investigative and evidential use of automated facial recognition?

Investigative use means AFR generates candidate matches that a human examiner then assesses; the AFR output itself is not presented as evidence. Evidential use means the AFR output, or a report based on it, is placed before a court to support identification. Courts in several jurisdictions have accepted AFR-assisted identification only when a trained human examiner has independently reviewed the candidate match, not the algorithm output alone.

What do NIST FRVT results show about accuracy disparities?

The NIST Face Recognition Vendor Testing programme has consistently found that error rates differ across demographic groups. False non-match rates are higher for women, older adults, and some ethnic groups compared with young adult males. False positive rates are also elevated for certain groups depending on the algorithm. An algorithm's overall quoted accuracy may hide significant disparities that matter for fairness in criminal justice applications.

What was the significance of R v. Atkins and Atkins in UK facial recognition law?

In the 2009 Court of Appeal case R v. Atkins and Atkins, the court accepted facial image comparison evidence presented by a suitably qualified expert. The case confirmed that facial image comparison is admissible under English law and that a trained examiner can give opinion evidence on the likelihood that two images depict the same person. It also reinforced that automated system outputs without expert review are not an adequate substitute for examined opinion evidence.

How does the EU AI Act classify automated facial recognition systems?

The EU AI Act places real-time remote biometric identification systems in the highest risk category, largely prohibited in public spaces for law enforcement unless specific narrow exceptions apply. Post-event AFR used in criminal investigations sits in the high-risk category requiring conformity assessments and transparency obligations.

What obligation does an expert have when reporting AFR-assisted identification?

An expert must disclose that AFR was used in generating the candidate, state the algorithm and its known performance characteristics including any relevant demographic accuracy disparities, confirm that the conclusion rests on independent human examination of the images rather than on the algorithm score alone, and acknowledge the limitations of the process.

Test yourself on Forensic Audio, Video and Image Analysis with free, timed mocks.

Practice Forensic Audio, Video and Image Analysis questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Key Takeaways

Your journey to becoming a forensic professional starts here.