Skip to content

DNA Match Probability Calculation

DNA match probability calculation uses the product rule to combine per-locus random match probabilities into a multi-locus profile frequency, with the NRC-II theta correction applied to account for population substructure. This topic covers the mathematics and inferential logic, the likelihood ratio framework, and the conditions under which these calculations are challenged in court.

Last updated:

Share

DNA match probability calculation is the process of estimating how often a given DNA profile would be expected to occur by chance in a reference population. The core method is the product rule: per-locus genotype frequencies drawn from a population database are multiplied together to yield a multi-locus profile frequency, called the random match probability (RMP). Because forensic databases reflect broad population groups rather than the small subpopulations a suspect may belong to, the NRC-II theta correction adjusts each heterozygote frequency upward to account for the elevated allele sharing that results from shared ancestry within subgroups. The resulting number is then placed inside a likelihood ratio (LR), which compares the probability of the observed DNA evidence under the prosecution hypothesis against its probability under the defence hypothesis. This three-step structure, product rule, substructure correction, likelihood ratio, is the basis of DNA evidence evaluation in virtually every jurisdiction that uses forensic DNA profiling.

The numerical output of these calculations can be very large: RMPs of one in a trillion or beyond are not unusual for full STR profiles with 15 to 20 loci. Courts in the United States, the United Kingdom, Australia, and India have all grappled with how such numbers should be communicated, whether they can be combined with prior probabilities to reach a posterior probability of guilt, and what happens when the population database does not include the defendant's ethnic group. These are not purely mathematical questions; they sit at the boundary between statistical science and the rules of evidence.

The scientific framework for forensic DNA statistics was shaped by two landmark US National Research Council reports, NRC-I in 1992 and NRC-II in 1996. The NRC-II theta correction and the ceiling principle debate it replaced are still cited in court challenges today. UK practice, codified by the Forensic Science Regulator and courts including the Court of Appeal in R v Adams [1996] and R v T [2010], has taken a formally Bayesian approach that emphasises the LR and resists presenting a posterior probability to the jury. The cross-jurisdictional tensions make this topic essential for anyone who will present, challenge, or evaluate DNA evidence in court.

By the end of this topic you will be able to:

  • Apply the product rule to calculate a multi-locus random match probability from per-locus genotype frequencies.
  • Explain why population substructure causes the product rule to underestimate match probability and apply the NRC-II theta correction to a heterozygote frequency.
  • Construct a likelihood ratio for a DNA match under prosecution and defence hypotheses and interpret its magnitude.
  • Identify the prosecutor's fallacy and the defence attorney's fallacy and explain why both misrepresent what a DNA match probability means.
  • Describe how database choice, mixture interpretation, and partial profiles affect the reliability of DNA match probability calculations.
Key terms
Random match probability (RMP)
The probability that a randomly chosen unrelated person from the reference population would share the observed DNA profile by chance. Calculated by multiplying per-locus genotype frequencies using the product rule. A small RMP means the profile is rare; it does not by itself establish that the suspect is the source.
Product rule
The statistical principle that the probability of independent events co-occurring is the product of their individual probabilities. Applied to STR profiles: the multi-locus RMP is the product of genotype frequencies at each locus, justified by the independence of loci on different chromosomes.
Theta (FST) correction
An adjustment recommended by NRC-II for population substructure. For a heterozygote with alleles a and b from subpopulation with allele frequencies p and q, the corrected frequency is 2[(1-theta)p + theta][(1-theta)q + theta] divided by (1+theta). Theta values of 0.01 to 0.03 are commonly used for major population groups.
Likelihood ratio (LR)
The ratio of the probability of the evidence given the prosecution hypothesis to the probability of the evidence given the defence hypothesis. An LR greater than one supports the prosecution hypothesis. The LR is the preferred expression of DNA evidence weight because it separates the scientific finding from the court's prior assessment of guilt.
Prosecutor's fallacy
The error of treating the RMP as the probability that the suspect is innocent. An RMP of one in a million means one person per million would match by chance; it does not mean the probability of innocence is one in a million. The fallacy ignores prior probabilities and the size of the population of possible suspects.
Hardy-Weinberg equilibrium (HWE)
The expected genotype frequency in a large, randomly mating population with no selection, mutation, or migration. Under HWE, the frequency of a heterozygote Aa is 2pq and a homozygote AA is p-squared, where p and q are allele frequencies. The product rule assumes HWE holds at each locus; the theta correction relaxes this assumption for substructured populations.

The product rule: combining locus probabilities

A forensic STR profile consists of genotypes at multiple loci, typically 15 to 24 in current practice (the FBI CODIS system expanded from 13 to 20 core loci in 2017; the UK National DNA Database uses 16 loci under the ESS standard). At each locus, the analyst observes either a heterozygote (two different alleles) or a homozygote (two copies of the same allele). The genotype frequency at each locus is estimated from a population reference database.

For a heterozygote with alleles a and b at frequencies p and q in the database, the Hardy-Weinberg genotype frequency is 2pq. For a homozygote with allele a at frequency p, the frequency is p-squared. The product rule then multiplies these frequencies across all loci to give the multi-locus RMP. If a profile has 20 loci and the per-locus genotype frequencies average around 0.05, the RMP is roughly 0.05 to the power of 20, which is 10 to the power of minus 26. In practice, per-locus frequencies vary considerably, and the final RMP for a full profile routinely falls in the range of one in several trillion to one in a quadrillion or beyond.

The product rule is valid only if the loci are statistically independent. For STR loci on different chromosomes, independence holds by Mendel's law of independent assortment. For loci on the same chromosome, independence holds if they are far enough apart that recombination between them is effectively random. The loci in the CODIS and ESS panels were selected partly on this criterion, and published tests of independence across major population databases support the assumption for these specific loci, though departures are occasionally observed and should be noted.

Population substructure and the NRC-II theta correction

National DNA databases are assembled from large, broadly defined population groups: European-American, African-American, Hispanic, South Asian, East Asian, and so on. Within each group, however, there are subpopulations, for example Gujarati Indians within the South Asian group, or Puerto Rican individuals within the Hispanic group, in which recent shared ancestry means that alleles are more correlated than the broad-group frequencies suggest. This is called population substructure, and it means the simple product rule underestimates how often two people from the same subgroup will share a profile.

The NRC-II report (1996) proposed the theta (FST) correction to handle this. Theta is a measure of genetic coancestry between members of the same subpopulation; typical values range from 0.01 for large, well-mixed populations to 0.03 for smaller, more isolated ones. For a heterozygote with alleles a and b at frequencies p and q, the theta-corrected frequency is:

2 x [(1-theta)p + theta] x [(1-theta)q + theta] / (1 + theta). For a homozygote with allele a at frequency p, the corrected formula is: [(2theta + (1-theta)p) x (3theta + (1-theta)p)] / [(1+theta)(1+2theta)]. These corrections make the genotype frequency larger (more conservative for the prosecution) than the uncorrected 2pq or p-squared estimate, reducing the risk that the product rule overstates the rarity of the profile.

Different jurisdictions have settled on different default theta values. US practice under FBI guidelines uses theta = 0.01 for major population groups and 0.03 for Native American populations. UK Forensic Science Regulator guidance recommends 0.02 as the default. Indian forensic laboratories applying DNA profiling typically draw on FST estimates derived from studies of Indian subpopulations, where values between 0.01 and 0.04 have been reported depending on the group. Courts in all these jurisdictions have accepted the theta correction as scientifically sound, though some defence challenges have argued for higher theta values specific to the defendant's subgroup.

ParameterWithout theta correctionWith theta = 0.01With theta = 0.03
Heterozygote 2pq (p=0.10, q=0.08)0.01600.01780.0213
Homozygote p² (p=0.10)0.01000.01210.0163
Effect on RMPBaselineRMP smaller (more conservative)RMP smaller still

The likelihood ratio framework

The RMP is a probability of the evidence under the defence hypothesis that a random unrelated person is the source of the crime stain. On its own it answers only half the question. The full evaluative framework requires a second number: the probability of the evidence under the prosecution hypothesis that the suspect is the source. If the profile is complete and the suspect's profile was correctly typed, this probability is one: the evidence is certain if the suspect is the source. The likelihood ratio is then 1 divided by the RMP, which equals the reciprocal of the RMP.

In more complex cases, the prosecution probability is not simply one. In a mixture, for example, the probability that the observed mixed profile would be seen even if the suspect were a contributor depends on the proportions and the number of contributors. In partial profiles where some loci could not be typed, the probability under the prosecution hypothesis may be less than one if the analyst cannot confirm all loci match. The LR framework handles these cases consistently: it always takes the ratio of the two conditional probabilities.

The LR is preferred over stating the RMP alone because it separates scientific inference from legal fact-finding. The scientist calculates the LR and presents it to the court. The court, applying Bayes' theorem implicitly, multiplies the LR by its prior odds of guilt to reach posterior odds. The scientist does not express an opinion on the posterior probability of guilt. This separation is formally endorsed by the Association of Forensic Science Providers (UK), SWGDAM (US), and the European Network of Forensic Science Institutes, and is increasingly adopted by forensic science regulators worldwide, including in Australia.

Population database selection and its effect on the calculation

The RMP depends on the allele frequencies used, which come from a population database. Forensic laboratories maintain databases for major population groups in their jurisdiction, and the choice of which database to apply to a given case can change the calculated RMP by an order of magnitude or more. Best practice, articulated in the SWGDAM guidelines (2016 revision) and UK Forensic Science Regulator guidance, is to calculate the RMP using the database whose population most closely matches the defendant's ancestry, and to report results from multiple databases.

This creates a practical challenge when the defendant belongs to a group not represented in available databases. In the United Kingdom, this issue arose in cases involving defendants of South Asian, East African, or mixed-ethnicity backgrounds. The Court of Appeal in R v Doheny and Adams [1997] set out principles for presenting DNA evidence that include the requirement to use appropriate population databases and to acknowledge their limitations. In India, where over 4,000 ethnic and caste subpopulations exist, the absence of large STR frequency databases for specific communities has been a persistent limitation. The DNA Technology (Use and Application) Regulation Bill, introduced in Parliament in 2019, addressed laboratory accreditation and database governance but did not resolve the underlying population genetics gap.

When no exact-match database exists, analysts use the closest available database and apply a conservative (higher) theta correction to account for the additional uncertainty. Some laboratories report the RMP across three or four databases and present the least favourable (lowest RMP, most favourable to the defendant) to the court. This practice is conservative toward the defence and is consistent with the general principle that forensic statistics should not overstate the strength of evidence.

Mixtures, partial profiles, and low-template DNA

Single-source, full profiles are the simplest case. A substantial proportion of forensic DNA samples are mixtures from two or more contributors, partial profiles where some loci could not be typed because of degradation or inhibition, or low-template samples where stochastic effects introduce dropout (alleles that should be present are absent) and drop-in (sporadic extra alleles). Each complication changes the probability calculation.

For mixtures, the standard approach through the early 2010s was to use conservative inclusion or exclusion criteria: if a suspect's alleles were all present somewhere in the mixture, they were included; the RMP was then calculated for the minor contributor using a method that summed all alleles present. This approach, called the combined probability of inclusion (CPI), has been criticised for being insufficiently discriminating: it assigns the same RMP to a weak inclusion as to a strong one. The current direction, now standard in the US (PCAST report, 2016) and UK, is probabilistic genotyping, in which software models the mixture statistically, considers all possible contributor combinations given the observed peak heights, and produces a LR for each candidate contributor.

For partial profiles, the LR is calculated only over the loci that could be typed. A partial profile produces a less discriminating LR because fewer loci are included. For low-template samples, probabilistic genotyping systems such as STRmix, TrueAllele, and ArmedXpert model dropout and drop-in probabilities explicitly, producing a LR that accounts for stochastic uncertainty. The validation requirements for these systems have been the subject of extensive court scrutiny in the US, UK, and Australia. The 2016 PCAST report recommended that probabilistic genotyping software be treated as a form of feature-comparison method requiring foundational validity studies before admission.

Court presentation and evaluative reporting standards

The question of how to present a DNA LR or RMP to a jury has generated a large body of case law. In the United Kingdom, the Court of Appeal in R v T [2010] EWCA Crim 2439 held that experts must not present a bare LR number without explaining what it means and that presenting a Bayesian calculation involving the prior probability of guilt was inappropriate in a jury trial. In R v Adams [1996] and [1998], the Court of Appeal rejected a defence attempt to have the jury perform a formal Bayesian calculation. These decisions shape how UK forensic scientists phrase their conclusions in DNA cases.

In the United States, individual state courts govern DNA evidence admissibility. Daubert v Merrell Dow Pharmaceuticals (1993) set the federal standard for expert scientific testimony: the method must be tested, peer-reviewed, have a known error rate, and be generally accepted. Courts applying Daubert to DNA statistics have accepted both the product rule with theta correction and probabilistic genotyping, though challenges to specific software implementations are ongoing. Frye jurisdictions, which retain the older general-acceptance standard, have also admitted these methods consistently.

The role of statistics in evidence evaluation is inseparable from the legal context: see Role of Statistics in Evidence Evaluation for a broader treatment. The history of how courts came to accept or resist probabilistic evidence is traced in History of Statistical Evidence in Courts. Under the Bharatiya Sakshya Adhiniyam 2023 in India, Section 57 (formerly Section 45 of the Indian Evidence Act) governs the admissibility of expert opinions including forensic DNA statistics; the court may accept an expert's LR statement but is not bound by it.

Check your understanding
Question 1 of 4· 0 answered

A suspect's STR profile matches the crime-scene profile at all 20 CODIS loci. The RMP is calculated as 1 in 2 trillion. What does this number mean?

Key Takeaways

  • The product rule multiplies per-locus genotype frequencies, estimated under Hardy-Weinberg equilibrium, to produce a multi-locus random match probability; it is valid only if the loci are statistically independent, which holds for the STR loci in CODIS and ESS panels.
  • The NRC-II theta correction inflates heterozygote and homozygote frequencies to account for elevated allele sharing within subpopulations; using theta = 0.01 to 0.03 makes the RMP more conservative and is now standard practice in US, UK, and Australian forensic laboratories.
  • The likelihood ratio expresses DNA evidence strength as the ratio of the probability of the evidence under the prosecution hypothesis to its probability under the defence hypothesis; for a single-source full profile it equals the reciprocal of the RMP, but mixtures and partial profiles require more complex probabilistic models.
  • The prosecutor's fallacy equates the RMP with the probability of innocence, and the defence attorney's fallacy claims that the existence of other possible matches negates probative value; both errors misrepresent what a likelihood ratio means.
  • Database selection, mixture interpretation method, and the presence of partial or low-template profiles are the three main sources of uncertainty in operational DNA statistics; probabilistic genotyping software now handles mixture and low-template cases in most accredited laboratories, but its validation requirements remain a live issue in courts internationally.
What is the product rule in DNA match probability calculation?
The product rule states that when genotype frequencies at multiple independent loci are known, the probability of a random person matching the full profile is the product of the individual genotype frequencies across all loci. It works because STR loci used in forensic profiling are located on different chromosomes or are far enough apart on the same chromosome that they segregate independently, satisfying the condition of statistical independence.
What is the NRC-II theta correction and why is it applied?
The theta (FST) correction was recommended in the 1996 NRC-II report to account for population substructure. Within any broad population database, subgroups share recent common ancestry and therefore have correlated allele frequencies. Without correction, the product rule underestimates the true probability that two people from the same subgroup share a profile. The corrected formula replaces simple Hardy-Weinberg heterozygote frequencies with a version inflated by theta, a measure of coancestry.
What is a likelihood ratio in the context of DNA evidence?
A likelihood ratio (LR) is the ratio of the probability of the observed DNA evidence under two competing hypotheses: the prosecution hypothesis that the suspect is the source, and the defence hypothesis that a random unrelated person is the source. An LR of one million means the evidence is one million times more probable if the suspect is the source than if an unrelated random person is. The LR is the scientifically preferred way to express DNA evidence strength because it separates the scientist's role from the court's role.
How does population database choice affect a DNA match probability?
Allele frequencies vary across populations, so the random match probability (RMP) depends on which database is used to estimate those frequencies. Best practice is to calculate the RMP using the population database that most closely matches the suspect's ancestry, and to report results from multiple databases to give the court the range. Using a database from a different population can produce a number that is either too high (favourable to prosecution) or too low (favourable to defence), both of which are problematic.
Can a DNA match probability be presented as the probability of innocence?
No. Equating the random match probability with the probability of innocence is the prosecutor's fallacy, a well-documented logical error. An RMP of one in a billion means that roughly one in a billion unrelated people would share the profile by chance; it does not mean the probability that the suspect is innocent is one in a billion. The correct probabilistic statement must account for prior probability, which is a matter for the court and not for the DNA analyst.

Test yourself on Forensic Statistics with free, timed mocks.

Practice Forensic Statistics questions

Found this useful? Pass it along.

Share

Spotted an error in this page? Report a correction or read our editorial standards.

Your journey to becoming a forensic professional starts here.

Practice with mock tests, learn from structured notes, and get your questions answered by a global forensic community, all in one place.