The Role of Statistics in Evidence Evaluation
Statistics provides the tools forensic scientists and courts use to measure evidential strength, quantify uncertainty, and communicate conclusions honestly. This topic maps the main tasks where statistical reasoning enters forensic practice: comparison, classification, source attribution, and evaluative reporting.
Last updated:
Statistics enters forensic practice wherever a scientist must compare a questioned sample to a reference, decide whether two items share a common source, or communicate how strongly the evidence supports one account of events over another. The core tasks are comparison (does this profile resemble that one?), classification (what type of material is this?), source attribution (could these two samples have the same origin?), and evaluative reporting (how much does the evidence change the probability of each competing proposition?). Each task draws on probability theory, population data, and a framework for expressing conclusions that courts and investigators can actually use. Statistics does not answer the ultimate question of guilt or innocence; it quantifies the evidential weight of the scientific findings so that decision-makers can integrate them with everything else they know.
The need for statistical reasoning in forensic science is not new, but its prominence has grown with the expansion of comparison disciplines: DNA profiling, glass analysis, fibre examination, toolmark comparison, and digital trace evidence all generate data that cannot be evaluated without some account of how common the observed features are in the relevant population. A finding is only informative if the analyst can say how surprising it would be under each competing explanation. That requires a model, a database, and an inferential framework.
Courts in the UK, the Netherlands, Australia, New Zealand, and several other jurisdictions have moved toward requiring likelihood ratio (LR) based evaluative reports rather than bare categorical opinions. In the United States, the National Commission on Forensic Science and the President's Council of Advisors on Science and Technology both published reports identifying weak statistical foundations as a systemic problem in forensic disciplines. The Bharatiya Sakshya Adhiniyam 2023 in India, like its predecessor, leaves the weight of expert evidence to the court, making the clarity of the statistical presentation particularly important. Across all these systems, the underlying statistical questions are the same.
By the end of this topic you will be able to:
- Explain the four main tasks where statistics enters forensic evidence evaluation and give one example of each.
- Define the likelihood ratio and state what an LR greater than 1 and an LR less than 1 each mean for evidential support.
- Identify and correct the prosecutor's fallacy and the defence fallacy in a written forensic conclusion.
- Describe what evaluative reporting requires and how it differs from categorical match or exclusion opinions.
- State two reasons why population database quality limits the reliability of forensic probability estimates.
- Likelihood ratio (LR)
- The probability of the observed evidence under hypothesis H1 divided by its probability under hypothesis H2. In forensic reporting, H1 is usually the prosecution proposition and H2 the defence proposition. The LR expresses how much the evidence shifts the odds between the two hypotheses.
- Bayes' theorem
- A rule relating prior probability, the likelihood ratio, and posterior probability. In forensic terms: posterior odds = prior odds multiplied by the LR. The theorem provides the logical framework for updating belief in a hypothesis in light of new evidence.
- Random match probability (RMP)
- The probability that a randomly chosen unrelated person from the reference population would share the observed feature profile. Used in DNA profiling and other comparison disciplines. The RMP is not the probability of innocence; confusing the two is the prosecutor's fallacy.
- Evaluative reporting
- A reporting framework in which the forensic scientist states a pair of propositions, assigns a likelihood ratio or verbal equivalent representing the evidential support, and documents the population data and method. It avoids categorical match or exclusion language and leaves the prior probability to the fact-finder.
- Prosecutor's fallacy
- The logical error of treating P(evidence | innocence) as equivalent to P(innocence | evidence). A very small match probability does not directly equal the probability that the defendant is the source; it must be combined with prior odds via Bayes' theorem.
- Validation study
- An empirical study that tests the accuracy, precision, and error rate of a forensic method under realistic conditions. Validation data are required to support the reliability of any probabilistic statement made using that method in court.
The four tasks where statistics enters forensic practice
Forensic statistics is not a single technique but a set of tools matched to distinct inferential tasks. Understanding which task is in play is the first step to choosing the right statistical approach and avoiding misinterpretation.
| Task | Forensic question | Statistical tool | Example discipline |
|---|---|---|---|
| Comparison | Do these two samples resemble each other more than chance? | Significance test, LR | Glass refractive index, fibre colour |
| Classification | What type or category does this sample belong to? | Discriminant analysis, Bayesian classifier | Soil type, drug identification |
| Source attribution | Could these samples share a common origin? | LR with population database | DNA profiling, handwriting features |
| Evaluative reporting | How much does this evidence support H1 over H2? | LR, verbal scale, Bayesian network | All comparison disciplines |
Comparison asks whether the measured difference between two samples is smaller than would be expected from measurement error or natural variation alone. Classification asks which group a sample most likely belongs to, given prior knowledge of group characteristics. Source attribution asks whether two samples plausibly came from the same object, person, or location, taking into account how common the observed features are in the population. Evaluative reporting is the communication step: it translates the statistical output into a statement about evidential support that a court can interpret without needing to understand the underlying mathematics.
The distinction between comparison and source attribution matters because two samples can be analytically indistinguishable yet still share a source attribution that is weak, if the features they share are very common in the population. A soil sample matching the control sample in colour and texture is analytically indistinguishable but statistically uninformative if those features are present in 40% of soils from the region. The statistical step is what converts an analytical finding into an evidential weight.
Probability, Bayes' theorem, and the likelihood ratio
Every forensic statistical statement is ultimately a statement about probability under competing hypotheses. The likelihood ratio is the central quantity. If H1 is the prosecution proposition (for example, the defendant transferred this fibre to the victim) and H2 is the defence proposition (the fibre came from some other source), then the LR is P(evidence | H1) divided by P(evidence | H2). An LR of 1000 means the evidence is 1000 times more probable under H1 than under H2. An LR of 1 means the evidence is equally probable under both and therefore provides no support either way. An LR less than 1 supports H2.
Bayes' theorem connects the LR to the overall probability update. In odds form: posterior odds = prior odds multiplied by the LR. The prior odds represent the probability of H1 relative to H2 before the forensic evidence is considered. Multiplying by the LR gives the posterior odds after the evidence. This is the logically correct way to incorporate forensic evidence, and it makes clear that the scientist's job is to provide the LR, while the fact-finder supplies the prior odds based on all other evidence in the case.
In practice the LR is often computed from a population database. For a DNA profile, the numerator P(evidence | H1) is typically 1 if the defendant is the source, because you would certainly observe the defendant's profile if the defendant left the sample. The denominator P(evidence | H2) is the random match probability: how often would a randomly chosen unrelated person produce the same profile? Large population databases, such as those maintained by the FBI's CODIS system or the UK National DNA Database, allow very precise RMP estimates for standard STR profiles. For less well-studied trace evidence types the databases are smaller and the estimates less precise.
See Conditional Probability and Independence for the mathematical foundations, and Random Match Probability for how RMP is derived from population databases.
Common fallacies in presenting forensic statistics
Two fallacies appear repeatedly in trial transcripts, expert reports, and press coverage of forensic cases. Both arise from confusing conditional probabilities that run in opposite directions.
The prosecutor's fallacy treats P(evidence | innocent) as if it were P(innocent | evidence). An analyst who says the DNA match probability is 1 in 10 million and then implies this means the defendant has only a 1-in-10-million chance of being innocent has made this error. The probability of an innocent person matching is very small, but the probability of guilt given a match depends on how many other potential sources exist and on all the other evidence in the case. The UK Court of Appeal identified this fallacy in R v Doheny and Adams (1997) and it remains a recognised ground of appeal in many common-law systems.
The defence fallacy runs the other direction: treating the existence of innocent explanations as equivalent to the evidence being uninformative. A defence argument that says there are millions of people who could match, so the evidence means nothing overstates the ambiguity. Even with one match in 10 million, a population of 100 million would yield about 10 expected matching individuals; the evidence still substantially narrows the field. The LR framework handles both fallacies correctly because it always expresses the evidence as a ratio rather than as a probability of the ultimate issue.
Evaluative reporting in forensic science
Evaluative reporting is the practice of expressing a forensic conclusion in terms of the relative support the evidence provides for two explicitly stated competing propositions. It replaces categorical language such as it is the opinion of this examiner that the samples have a common source with a statement of the form: the evidence is strongly more probable if the samples share a common source than if they originate from different sources drawn from the relevant population.
The European Network of Forensic Science Institutes (ENFSI) published a guideline on evaluative reporting in 2015 that sets out the framework most European laboratories use. The guideline requires: explicit statement of the propositions considered, the level of the hierarchy of propositions addressed (source, activity, offence), the likelihood ratio assigned, the population and database underpinning the LR, and any assumptions made. The UK Forensic Science Regulator's Codes of Practice and Conduct mandate the same approach for accredited laboratories in England and Wales. Australian and New Zealand guidelines follow similar principles.
The proposition hierarchy matters. A source-level proposition asks whether two samples share a common physical origin. An activity-level proposition asks whether the defendant performed the action alleged, given that the trace was found where it was. An offence-level proposition asks whether the defendant committed the offence. Scientists are generally equipped to address source-level propositions and sometimes activity-level propositions, but offence-level propositions require legal and factual knowledge beyond the scientist's expertise. Mixing levels is a common error in evaluative reporting that courts have criticised.
Verbal scales convert numeric LR values into descriptive phrases for court communication. ENFSI's recommended scale runs from limited support through moderate, moderately strong, strong, very strong, and extremely strong support, with each tier corresponding to a LR range. The scale is a communication tool, not a substitute for the numeric LR, and the underlying number should always be available for scrutiny.
Error rates, validation, and what statistics cannot do
Every forensic method produces errors. A classification method that assigns a questioned document to one of two authors will sometimes assign it to the wrong author. A DNA profile interpretation protocol will occasionally call a present allele absent (drop-out) or an absent allele present (drop-in). The frequency of these errors in a population of realistic casework samples, measured under casework conditions, is the empirical error rate of the method.
Validation studies establish whether a method works as claimed. A valid method must be shown to produce low error rates on samples whose true answer is known, to perform consistently across operators and laboratories, and to degrade in predictable ways as sample quality decreases. The 2016 PCAST report in the United States reviewed the validation evidence for several forensic feature-comparison disciplines including firearms, bite marks, and footwear, and found that many lacked adequate foundational validity data. The report did not conclude the methods were worthless, but it did conclude that claims of high discriminating power were not supported by the published validation literature.
Statistics cannot substitute for a weak evidential foundation. A sophisticated Bayesian network built on unvalidated feature measurements will produce precise-looking numbers with wide underlying uncertainty that the precision conceals. The discipline requirement is that every input to the statistical model, the feature measurements, the population database, the error rate, must itself be empirically grounded. When inputs are missing or uncertain, the output uncertainty must be propagated and communicated, not hidden behind a point estimate.
Population databases and their limits
The quality of a forensic probability estimate is bounded by the quality of the population database underlying it. Three parameters matter: size, composition, and sampling method. A DNA reference database of 10 000 unrelated individuals from a single ethnic group will produce RMP estimates that are unreliable for individuals from different ethnic backgrounds, because allele frequencies differ between populations. The standard approach in jurisdictions with mixed populations, including the United States, Australia, and India, is to compute the RMP for each major population group and report the least favourable (highest) figure, or to compute a weighted average.
Sampling method affects whether the database is representative. A database assembled from convicted offenders is not representative of the general population in the same way a random household survey would be, because the offender population has demographic characteristics that differ from the general population. For forensic disciplines where the relevant population is all people who could plausibly be the source, a database that over-represents certain groups will produce biased probability estimates.
For some trace evidence types, no published population database exists, or the available databases are very small. Glass refractive index databases, fibre colour databases, and soil mineral composition databases are all smaller and less standardised than DNA reference databases. In these cases the scientist must be explicit about the database used, its size, how it was assembled, and how the absence of a larger database affects confidence in the probability estimate. Overstating precision when the database does not support it is a form of scientific misrepresentation.
See Population Databases for Forensic Statistics for a detailed treatment of database construction and its effect on probability calculations.
A forensic analyst states: 'The probability that this DNA profile belongs to someone other than the defendant is 1 in 50 million.' Which fallacy does this statement commit?
Key Takeaways
- Forensic statistics serves four tasks: comparison, classification, source attribution, and evaluative reporting. Each requires different tools, but all require a model of variation in the relevant population.
- The likelihood ratio is the standard measure of evidential weight: it divides the probability of the evidence under the prosecution hypothesis by its probability under the defence hypothesis, and it feeds directly into Bayes' theorem as the factor that updates prior odds to posterior odds.
- The prosecutor's fallacy confuses P(evidence | innocence) with P(innocence | evidence); the defence fallacy overstates ambiguity. Both are avoided by using the LR framework consistently and never expressing a statistical conclusion as a probability of the ultimate issue.
- Evaluative reporting requires explicit propositions, an assigned LR with verbal equivalent, and documentation of the population database and method. Courts in England and Wales, Australia, the Netherlands, and several other jurisdictions now require this format from accredited laboratories.
- The quality of a probability estimate is bounded by the size, composition, and sampling method of the underlying population database. Overstating precision when the database is small or unrepresentative is a recognised form of scientific misrepresentation that courts have penalised.
What is a likelihood ratio in forensic evidence evaluation?
What is the difference between a match probability and a likelihood ratio?
What is the prosecutor's fallacy?
What does evaluative reporting mean in forensic science?
Why do forensic scientists need population databases?
Test yourself on Forensic Statistics with free, timed mocks.
Practice Forensic Statistics questionsSpotted an error in this page? Report a correction or read our editorial standards.