The Role of Statistics in Evidence Evaluation

Statistics provides the tools forensic scientists and courts use to measure evidential strength, quantify uncertainty, and communicate conclusions honestly. This topic maps the main tasks where statistical reasoning enters forensic practice: comparison, classification, source attribution, and evaluative reporting.

Last updated: 24 Jun 2026

Statistics enters forensic practice wherever a scientist must compare a questioned sample to a reference, decide whether two items share a common source, or communicate how strongly the evidence supports one account of events over another. The core tasks are comparison (does this profile resemble that one?), classification (what type of material is this?), source attribution (could these two samples have the same origin?), and evaluative reporting (how much does the evidence change the probability of each competing proposition?). Each task draws on probability theory, population data, and a framework for expressing conclusions that courts and investigators can actually use. Statistics does not answer the ultimate question of guilt or innocence; it quantifies the evidential weight of the scientific findings so that decision-makers can integrate them with everything else they know.

The need for statistical reasoning in forensic science is not new, but its prominence has grown with the expansion of comparison disciplines: DNA profiling, glass analysis, fibre examination, toolmark comparison, and digital trace evidence all generate data that cannot be evaluated without some account of how common the observed features are in the relevant population. A finding is only informative if the analyst can say how surprising it would be under each competing explanation. That requires a model, a database, and an inferential framework.

Courts in the UK, the Netherlands, Australia, New Zealand, and several other jurisdictions have moved toward requiring likelihood ratio (LR) based evaluative reports rather than bare categorical opinions. In the United States, the National Commission on Forensic Science and the President's Council of Advisors on Science and Technology both published reports identifying weak statistical foundations as a systemic problem in forensic disciplines. The Bharatiya Sakshya Adhiniyam 2023 in India, like its predecessor, leaves the weight of expert evidence to the court, making the clarity of the statistical presentation particularly important. Across all these systems, the underlying statistical questions are the same.

By the end of this topic you will be able to:

Explain the four main tasks where statistics enters forensic evidence evaluation and give one example of each.
Define the likelihood ratio and state what an LR greater than 1 and an LR less than 1 each mean for evidential support.
Identify and correct the prosecutor's fallacy and the defence fallacy in a written forensic conclusion.
Describe what evaluative reporting requires and how it differs from categorical match or exclusion opinions.
State two reasons why population database quality limits the reliability of forensic probability estimates.

Key terms

Likelihood ratio (LR): The probability of the observed evidence under hypothesis H1 divided by its probability under hypothesis H2. In forensic reporting, H1 is usually the prosecution proposition and H2 the defence proposition. The LR expresses how much the evidence shifts the odds between the two hypotheses.
Bayes' theorem: A rule relating prior probability, the likelihood ratio, and posterior probability. In forensic terms: posterior odds = prior odds multiplied by the LR. The theorem provides the logical framework for updating belief in a hypothesis in light of new evidence.
Random match probability (RMP): The probability that a randomly chosen unrelated person from the reference population would share the observed feature profile. Used in DNA profiling and other comparison disciplines. The RMP is not the probability of innocence; confusing the two is the prosecutor's fallacy.
Evaluative reporting: A reporting framework in which the forensic scientist states a pair of propositions, assigns a likelihood ratio or verbal equivalent representing the evidential support, and documents the population data and method. It avoids categorical match or exclusion language and leaves the prior probability to the fact-finder.
Prosecutor's fallacy: The logical error of treating P(evidence | innocence) as equivalent to P(innocence | evidence). A very small match probability does not directly equal the probability that the defendant is the source; it must be combined with prior odds via Bayes' theorem.
Validation study: An empirical study that tests the accuracy, precision, and error rate of a forensic method under realistic conditions. Validation data are required to support the reliability of any probabilistic statement made using that method in court.

The four tasks where statistics enters forensic practice

Forensic statistics is not a single technique but a set of tools matched to distinct inferential tasks. Understanding which task is in play is the first step to choosing the right statistical approach and avoiding misinterpretation.

Task	Forensic question	Statistical tool	Example discipline
Comparison	Do these two samples resemble each other more than chance?	Significance test, LR	Glass refractive index, fibre colour
Classification	What type or category does this sample belong to?	Discriminant analysis, Bayesian classifier	Soil type, drug identification
Source attribution	Could these samples share a common origin?	LR with population database	DNA profiling, handwriting features
Evaluative reporting	How much does this evidence support H1 over H2?	LR, verbal scale, Bayesian network	All comparison disciplines

Comparison asks whether the measured difference between two samples is smaller than would be expected from measurement error or natural variation alone. Classification asks which group a sample most likely belongs to, given prior knowledge of group characteristics. Source attribution asks whether two samples plausibly came from the same object, person, or location, taking into account how common the observed features are in the population. Evaluative reporting is the communication step: it translates the statistical output into a statement about evidential support that a court can interpret without needing to understand the underlying mathematics.

The distinction between comparison and source attribution matters because two samples can be analytically indistinguishable yet still share a source attribution that is weak, if the features they share are very common in the population. A soil sample matching the control sample in colour and texture is analytically indistinguishable but statistically uninformative if those features are present in 40% of soils from the region. The statistical step is what converts an analytical finding into an evidential weight.

Probability, Bayes' theorem, and the likelihood ratio

Every forensic statistical statement is ultimately a statement about probability under competing hypotheses. The likelihood ratio is the central quantity. If H1 is the prosecution proposition (for example, the defendant transferred this fibre to the victim) and H2 is the defence proposition (the fibre came from some other source), then the LR is P(evidence | H1) divided by P(evidence | H2). An LR of 1000 means the evidence is 1000 times more probable under H1 than under H2. An LR of 1 means the evidence is equally probable under both and therefore provides no support either way. An LR less than 1 supports H2.

Bayes' theorem connects the LR to the overall probability update. In odds form: posterior odds = prior odds multiplied by the LR. The prior odds represent the probability of H1 relative to H2 before the forensic evidence is considered. Multiplying by the LR gives the posterior odds after the evidence. This is the logically correct way to incorporate forensic evidence, and it makes clear that the scientist's job is to provide the LR, while the fact-finder supplies the prior odds based on all other evidence in the case.

In practice the LR is often computed from a population database. For a DNA profile, the numerator P(evidence | H1) is typically 1 if the defendant is the source, because you would certainly observe the defendant's profile if the defendant left the sample. The denominator P(evidence | H2) is the random match probability: how often would a randomly chosen unrelated person produce the same profile? Large population databases, such as those maintained by the FBI's CODIS system or the UK National DNA Database, allow very precise RMP estimates for standard STR profiles. For less well-studied trace evidence types the databases are smaller and the estimates less precise.

See Conditional Probability and Independence for the mathematical foundations, and Random Match Probability for how RMP is derived from population databases.

Common fallacies in presenting forensic statistics

Two fallacies appear repeatedly in trial transcripts, expert reports, and press coverage of forensic cases. Both arise from confusing conditional probabilities that run in opposite directions.

The prosecutor's fallacy treats P(evidence | innocent) as if it were P(innocent | evidence). An analyst who says the DNA match probability is 1 in 10 million and then implies this means the defendant has only a 1-in-10-million chance of being innocent has made this error. The probability of an innocent person matching is very small, but the probability of guilt given a match depends on how many other potential sources exist and on all the other evidence in the case. The UK Court of Appeal identified this fallacy in R v Doheny and Adams (1997) and it remains a recognised ground of appeal in many common-law systems.

The defence fallacy runs the other direction: treating the existence of innocent explanations as equivalent to the evidence being uninformative. A defence argument that says there are millions of people who could match, so the evidence means nothing overstates the ambiguity. Even with one match in 10 million, a population of 100 million would yield about 10 expected matching individuals; the evidence still substantially narrows the field. The LR framework handles both fallacies correctly because it always expresses the evidence as a ratio rather than as a probability of the ultimate issue.

Evaluative reporting in forensic science

Evaluative reporting is the practice of expressing a forensic conclusion in terms of the relative support the evidence provides for two explicitly stated competing propositions. It replaces categorical language such as it is the opinion of this examiner that the samples have a common source with a statement of the form: the evidence is strongly more probable if the samples share a common source than if they originate from different sources drawn from the relevant population.

The European Network of Forensic Science Institutes (ENFSI) published a guideline on evaluative reporting in 2015 that sets out the framework most European laboratories use. The guideline requires: explicit statement of the propositions considered, the level of the hierarchy of propositions addressed (source, activity, offence), the likelihood ratio assigned, the population and database underpinning the LR, and any assumptions made. The UK Forensic Science Regulator's Codes of Practice and Conduct mandate the same approach for accredited laboratories in England and Wales. Australian and New Zealand guidelines follow similar principles.

The proposition hierarchy matters. A source-level proposition asks whether two samples share a common physical origin. An activity-level proposition asks whether the defendant performed the action alleged, given that the trace was found where it was. An offence-level proposition asks whether the defendant committed the offence. Scientists are generally equipped to address source-level propositions and sometimes activity-level propositions, but offence-level propositions require legal and factual knowledge beyond the scientist's expertise. Mixing levels is a common error in evaluative reporting that courts have criticised.

Verbal scales convert numeric LR values into descriptive phrases for court communication. ENFSI's recommended scale runs from limited support through moderate, moderately strong, strong, very strong, and extremely strong support, with each tier corresponding to a LR range. The scale is a communication tool, not a substitute for the numeric LR, and the underlying number should always be available for scrutiny.

Error rates, validation, and what statistics cannot do

Every forensic method produces errors. A classification method that assigns a questioned document to one of two authors will sometimes assign it to the wrong author. A DNA profile interpretation protocol will occasionally call a present allele absent (drop-out) or an absent allele present (drop-in). The frequency of these errors in a population of realistic casework samples, measured under casework conditions, is the empirical error rate of the method.

Validation studies establish whether a method works as claimed. A valid method must be shown to produce low error rates on samples whose true answer is known, to perform consistently across operators and laboratories, and to degrade in predictable ways as sample quality decreases. The 2016 PCAST report in the United States reviewed the validation evidence for several forensic feature-comparison disciplines including firearms, bite marks, and footwear, and found that many lacked adequate foundational validity data. The report did not conclude the methods were worthless, but it did conclude that claims of high discriminating power were not supported by the published validation literature.

Statistics cannot substitute for a weak evidential foundation. A sophisticated Bayesian network built on unvalidated feature measurements will produce precise-looking numbers with wide underlying uncertainty that the precision conceals. The discipline requirement is that every input to the statistical model, the feature measurements, the population database, the error rate, must itself be empirically grounded. When inputs are missing or uncertain, the output uncertainty must be propagated and communicated, not hidden behind a point estimate.

Population databases and their limits

The quality of a forensic probability estimate is bounded by the quality of the population database underlying it. Three parameters matter: size, composition, and sampling method. A DNA reference database of 10 000 unrelated individuals from a single ethnic group will produce RMP estimates that are unreliable for individuals from different ethnic backgrounds, because allele frequencies differ between populations. The standard approach in jurisdictions with mixed populations, including the United States, Australia, and India, is to compute the RMP for each major population group and report the least favourable (highest) figure, or to compute a weighted average.

Sampling method affects whether the database is representative. A database assembled from convicted offenders is not representative of the general population in the same way a random household survey would be, because the offender population has demographic characteristics that differ from the general population. For forensic disciplines where the relevant population is all people who could plausibly be the source, a database that over-represents certain groups will produce biased probability estimates.

For some trace evidence types, no published population database exists, or the available databases are very small. Glass refractive index databases, fibre colour databases, and soil mineral composition databases are all smaller and less standardised than DNA reference databases. In these cases the scientist must be explicit about the database used, its size, how it was assembled, and how the absence of a larger database affects confidence in the probability estimate. Overstating precision when the database does not support it is a form of scientific misrepresentation.

See Population Databases for Forensic Statistics for a detailed treatment of database construction and its effect on probability calculations.

Worked example

Evaluating a glass fragment match using the likelihood ratio framework

A step-by-step example showing how the LR framework is applied to a physical trace evidence comparison, from the analytical measurement to the courtroom-ready evaluative statement.

A broken window pane is found at a burglary scene. Glass fragments recovered from the suspect's clothing are submitted for comparison. The forensic scientist measures refractive index (RI) for each fragment using the oil immersion method with temperature control. The suspect's fragments have a mean RI of 1.51842 with a standard deviation of 0.00003. The control glass from the scene has a mean RI of 1.51840 with a standard deviation of 0.00002. The question is how to express the strength of this match evidentially.

State the propositions. H1 (prosecution): the glass on the suspect's clothing came from the broken window. H2 (defence): the glass on the suspect's clothing came from some other source in the relevant population of glass objects.
Compute P(evidence | H1). If H1 is true, the suspect's fragments and the control glass come from the same pane. The observed difference in mean RI (0.00002) is within the measurement uncertainty of the method. The probability of observing data this close given a common source is high, quantified using the within-source variation model: here approximately 0.82 under a kernel density or parametric model.
Compute P(evidence | H2). If H2 is true, the suspect's fragments come from a different pane drawn from the background population. The glass RI database (for example, the UK FSS glass database or the GFDB maintained by research groups) is queried for the proportion of glass samples with RI in the range 1.51840 plus or minus 0.00010. This proportion is approximately 0.12 in modern float glass databases.
Calculate the LR. LR = 0.82 / 0.12 = approximately 6.8. The evidence is about 7 times more probable if the glass came from the crime scene window than if it came from a random background source.
Map to a verbal scale. An LR of approximately 7 falls in the limited to moderate support range on the ENFSI verbal scale (LR between 1 and 10 is modest, between 10 and 100 is moderate in some scales). The scientist documents the propositions, the database used, its size, and the numerical LR in the report.
Write the evaluative statement. The correct form: The evidence is approximately 7 times more probable if the glass fragments originated from the broken window than if they originated from a random glass object drawn from the relevant background population. This statement does not assert that the fragments came from the window; it expresses the evidential weight for the court to combine with other evidence.

Check your understanding

Question 1 of 4· 0 answered

A forensic analyst states: 'The probability that this DNA profile belongs to someone other than the defendant is 1 in 50 million.' Which fallacy does this statement commit?

Key Takeaways

Forensic statistics serves four tasks: comparison, classification, source attribution, and evaluative reporting. Each requires different tools, but all require a model of variation in the relevant population.
The likelihood ratio is the standard measure of evidential weight: it divides the probability of the evidence under the prosecution hypothesis by its probability under the defence hypothesis, and it feeds directly into Bayes' theorem as the factor that updates prior odds to posterior odds.
The prosecutor's fallacy confuses P(evidence | innocence) with P(innocence | evidence); the defence fallacy overstates ambiguity. Both are avoided by using the LR framework consistently and never expressing a statistical conclusion as a probability of the ultimate issue.
Evaluative reporting requires explicit propositions, an assigned LR with verbal equivalent, and documentation of the population database and method. Courts in England and Wales, Australia, the Netherlands, and several other jurisdictions now require this format from accredited laboratories.
The quality of a probability estimate is bounded by the size, composition, and sampling method of the underlying population database. Overstating precision when the database is small or unrepresentative is a recognised form of scientific misrepresentation that courts have penalised.

What is a likelihood ratio in forensic evidence evaluation?

A likelihood ratio (LR) is the probability of the observed evidence given one hypothesis, divided by the probability of the same evidence given an alternative hypothesis. In forensic practice the two hypotheses are usually prosecution and defence propositions. An LR greater than 1 supports the prosecution proposition; an LR less than 1 supports the defence proposition. The LR tells the court how much the evidence changes the odds, without prescribing what the prior odds should be.

What is the difference between a match probability and a likelihood ratio?

A random match probability (RMP) is the probability that a randomly selected unrelated person from a reference population would share the observed profile. A likelihood ratio compares two complete hypotheses: evidence given prosecution hypothesis versus evidence given defence hypothesis. The LR framework is more complete because it explicitly models both hypotheses, whereas an RMP addresses only one side of the comparison. Courts in the UK and many European jurisdictions now prefer LR-based statements over bare RMP figures.

What is the prosecutor's fallacy?

The prosecutor's fallacy is the error of treating the probability of the evidence given innocence as if it were the probability of innocence given the evidence. For example, stating that a DNA match probability of 1 in 1 million means the defendant has a 1-in-a-million chance of being innocent conflates two distinct conditional probabilities. The correct statement is that the evidence is 1 million times more probable if the defendant is the source than if a random person is the source.

What does evaluative reporting mean in forensic science?

Evaluative reporting means expressing a forensic conclusion as a statement about the relative support the evidence provides for two competing propositions, rather than as a categorical match or exclusion. An evaluative report states the proposition pair explicitly, assigns a likelihood ratio or verbal equivalent, and explains the population database and method used. This approach is endorsed by the European Network of Forensic Science Institutes (ENFSI) and is standard practice in the UK Forensic Science Regulator's framework.

Why do forensic scientists need population databases?

Population databases supply the denominator for probability calculations. To compute how common a particular DNA profile, fibre colour, or glass refractive index is in a relevant population, the scientist must have measurements from a representative sample of that population. The size, composition, and sampling method of the database directly affect the reliability of the resulting probability estimate. A database that is too small or drawn from an unrepresentative group produces unreliable statistics.

Test yourself on Forensic Statistics with free, timed mocks.

Practice Forensic Statistics questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Key Takeaways

Your journey to becoming a forensic professional starts here.