Population Databases for Forensic Statistics

Reference population databases supply the frequency data that underpin match probability calculations in forensic science, from DNA random match probabilities to glass refractive index comparisons. This topic explains how those databases are built, stratified, and validated, and what happens when the reference population does not match the suspect population.

Last updated: 24 Jun 2026

A reference population database is a structured collection of measurements or allele frequencies drawn from a defined human population, used to calculate the probability that a randomly chosen person from that population would match the forensic evidence in question. When a forensic scientist says that a DNA profile has a random match probability of one in ten billion, that number comes from multiplying allele frequencies stored in a reference database. The validity of the statistic, and ultimately the weight a court gives to the evidence, depends on whether the database accurately represents the population from which the suspect is drawn. The same logic applies to glass refractive index databases, paint colour databases, fibre spectra databases, and any other reference collection used to estimate the rarity of a physical characteristic.

Building a valid reference database requires decisions about who to include, how many samples to collect, how to handle population substructure, and how to validate the resulting frequency estimates. These decisions are not merely technical: they affect whether a court will accept the statistical evidence, whether a defendant's population is fairly represented, and whether reported match probabilities are conservative or inflated. The FBI's CODIS database, the UK National DNA Database, India's DNA Profiling Bill framework, and equivalent systems in Australia, Canada, and the European Union all navigate these questions, and their answers differ in meaningful ways.

Forensic reference databases are periodically updated as populations change, as analytical methods improve, and as new loci are added to profiling systems. The 13-locus CODIS system used in the United States expanded to 20 loci in 2017, requiring new frequency estimates from the existing population samples. Every expansion requires re-validation and creates a transition period during which laboratories must manage profiles typed under the old and new systems simultaneously. Understanding how databases are constructed and why they must be kept current is part of understanding any forensic statistic derived from them.

By the end of this topic you will be able to:

Describe the criteria for a well-constructed forensic reference database, including sample size, sampling strategy, and population definition.
Explain what population substructure is and how the theta correction adjusts for it in DNA match probability calculations.
Identify the main forensic evidence types that rely on reference databases and compare the validation standards applied to each.
Explain the consequences of using a mismatched reference population and how courts in different jurisdictions have responded to this issue.
Describe how a reference database is validated, including proficiency testing, reproducibility checks, and comparison against other databases.

Key terms

Reference population database: A curated collection of allele frequencies or feature measurements from a defined population, used to estimate the probability that a randomly chosen member of that population would show the same forensic feature as the questioned sample.
Population substructure: The condition in which allele frequencies differ systematically between subgroups within a nominally single population. Substructure means that individuals within a subgroup share alleles more often than random mating would predict, which can cause a database to under-estimate or over-estimate match probabilities for members of a specific subgroup.
Theta correction (FST or coancestry coefficient): A statistical adjustment applied to DNA match probability calculations to account for expected allele-sharing within subpopulations due to substructure. A theta value of 0.01 to 0.03 is commonly applied in forensic practice; higher values are more conservative. Also called the coancestry coefficient or FST.
Allele frequency: The proportion of a specific allele variant at a given genetic locus in a defined population. Forensic match probability calculations multiply the allele frequencies observed at each locus, using the product rule, to compute the overall profile frequency.
Minimum allele frequency (minimum frequency ceiling): A conservative lower bound applied when an allele is absent or very rare in the database, to prevent an artificially inflated match probability statistic. Common values are 5/2N (where N is the database size) or a fixed floor such as 0.001 (one in a thousand).
Hardy-Weinberg equilibrium (HWE): The expected genotype frequency distribution in a population that is randomly mating, large, and free from selection, mutation, and migration at a given locus. Forensic databases are tested for HWE at each locus: significant departure may indicate substructure, typing errors, or sample quality problems.

What makes a valid forensic reference database

Three requirements define a valid reference database: the samples must come from the right population, the database must be large enough to estimate frequencies with acceptable uncertainty, and the collection and typing process must be reproducible. The right population is the one that a trier of fact would consider a plausible source of the crime scene sample, given what is known about the suspect and the circumstances of the offence. In practice, forensic databases are built around ethnic or geographic categories because those are the groupings for which biological frequency data can be systematically collected. The resulting mismatch between biological reality (which is continuous) and database categories (which are discrete) is one of the persistent tensions in forensic population genetics.

Sample size requirements depend on the rarity of the feature being measured and the number of subpopulations to be covered. The National Research Council's 1996 report on DNA evidence (NRC II) recommended databases of at least 100 unrelated individuals per population group for each locus. At this size, a 95% confidence interval around an estimated frequency of 0.01 spans roughly 0.003 to 0.025, which is adequate for most forensic reporting. The SWGDAM 2016 guidelines for the US and the European Network of Forensic Science Institutes (ENFSI) guidance for EU member states both set minimum sample sizes in the same range, while noting that larger databases reduce uncertainty and are preferred.

Reproducibility requires that the typing method, quality control criteria, and data-entry procedures are documented and consistently applied. A database built partly from one kit chemistry and partly from another is of limited value unless the cross-kit concordance has been validated. This issue arose concretely when STR profiling replaced RFLP profiling in the 1990s and when expanded CODIS loci were introduced in 2017: existing databases had to be retyped or re-validated before the new locus frequencies could be trusted.

Population substructure and the theta correction

Human populations are not randomly mating globally. Geographic barriers, cultural practices, and shared ancestry mean that allele frequencies in a Punjabi population differ from those in a Tamil population, which differ from those in a Bengali population. The same is true within nominally single ethnic groups in the United States (African American allele frequencies differ between northern and southern states), in Australia (Aboriginal Australian allele frequencies differ substantially between regional groups), and in every other country where systematic studies have been done. This internal variation is called population substructure or genetic stratification.

Substructure matters for forensic statistics because the product rule, which multiplies per-locus match probabilities to obtain an overall profile frequency, assumes that the loci are independent and that the sampled population is in Hardy-Weinberg equilibrium. When substructure is present, members of a subpopulation share alleles more often than the product rule predicts, because they share recent common ancestors. The result is that the product rule overstates how rare a profile is within the suspect's subpopulation, making the evidence appear stronger than it is.

The theta correction, formalised by Balding and Nichols in 1994, adjusts for this effect. For a homozygous genotype at a single locus, the uncorrected frequency is p squared (where p is the allele frequency in the database). The theta-corrected frequency is [2theta + (1-theta)p][3theta + (1-theta)p] divided by (1+theta)(1+2theta). In practice, most forensic laboratories apply theta values between 0.01 and 0.03 for within-country subpopulations and up to 0.05 for indigenous or highly isolated groups. The correction is conservative: it makes the match probability higher (less incriminating) than the uncorrected figure.

Parameter	Without theta correction	With theta = 0.01	With theta = 0.03
Homozygous genotype (p = 0.10)	p² = 0.010	~0.013	~0.021
Heterozygous genotype (p = 0.10, q = 0.05)	2pq = 0.010	~0.011	~0.013
Profile frequency (10 loci)	product of locus frequencies	higher at each locus	higher still at each locus
Direction of effect	n/a	more conservative	more conservative

The UK Forensic Science Service adopted theta = 0.03 as a standard correction for DNA evidence presented in English and Welsh courts. The FBI applies theta = 0.01 as a default for CODIS loci, with higher values available for specific groups. Indian forensic practice under the DNA Technology (Use and Application) Regulation Bill 2019 (pending enactment as of 2026) proposes separate databases for major ethnic and tribal groups to reduce reliance on a single correction factor.

DNA reference databases: construction and examples

The largest and most extensively validated forensic reference databases are for DNA STR profiles. The FBI's CODIS population data, published by Steffen et al. in 2017, covers four US population groups: European American, African American, US Hispanic, and Asian American. Each group contains 361 to 1036 individuals typed at 29 loci including the 20 core CODIS loci. The UK National DNA Database (NDNAD) uses population frequency tables maintained by the Forensic Science Regulator. Germany, the Netherlands, France, and other EU member states maintain national STR frequency databases published in peer-reviewed literature, with the Caucasian European population best represented.

In South Asia, the forensic DNA database infrastructure is less developed. Published STR frequency data exists for several Indian regional populations, including studies on Punjabi, Tamil, Bengali, and Rajasthani groups, but no single authoritative national database exists for forensic use. The Central Forensic Science Laboratory (CFSL) in India uses internally compiled frequency tables that are not uniformly published. This creates a transparency problem: a defence lawyer challenging a DNA statistic in an Indian court cannot easily access the underlying database to verify the calculation. The DNA Technology Bill, if enacted, would require the National DNA Data Bank to maintain and publish population data, moving India closer to the transparency standard already operating in the UK and US.

Hardy-Weinberg equilibrium testing and linkage disequilibrium testing are the primary validation tools for STR databases. If a database passes HWE tests at each locus and shows no significant linkage disequilibrium between loci, the product rule can be applied without adjustment. The Steffen et al. CODIS dataset and the major European national databases pass these tests at the locus level, supporting the standard reporting approach. Databases with HWE departures at multiple loci are an indicator of sample quality problems or significant unaccounted substructure and should not be used without further investigation.

Reference databases for non-DNA forensic evidence

The same principles that govern DNA databases apply to reference collections for other forensic evidence types, but the databases are generally smaller, less rigorously validated, and more jurisdiction-specific. Glass refractive index databases are a well-studied example. The refractive index (RI) of a glass fragment tells a forensic scientist the optical density of the glass, which varies systematically with glass type and manufacturer. A reference database of RI values from known glass sources allows the scientist to calculate how common the measured RI is in the target population of glass likely to appear in that jurisdiction.

The UK Forensic Science Service built a glass RI database from windscreen and building glass samples collected over many years, which underpins the likelihood ratio calculations reported in UK glass evidence. The database reflects UK glass manufacturers and imports prevalent at the time of collection: its relevance to a case involving glass from a different country or a different time period requires separate justification. Similar databases exist in Australia (maintained by the Australian Federal Police forensic laboratory) and in several European countries, but they are not directly interchangeable because glass composition varies with manufacturer and national building standards.

Fingerprint frequency databases are a more contested area. Traditional fingerprint examination uses categorical classification (arches, loops, whorls) and point-counting, neither of which directly produces a probabilistic frequency statement from a database. Statistical approaches to fingerprint frequency, including the models developed by Srihari et al. and by Neumann et al., use large databases of known prints to estimate the probability of observing a given minutiae configuration. These databases are maintained by institutions such as the FBI and the Netherlands Forensic Institute. Courts in the US and UK have been cautious about full probabilistic reporting of fingerprint evidence, partly because the underlying databases and models have not been as thoroughly peer-reviewed as DNA databases.

Evidence type	Database basis	Typical size	Validation status
DNA STR profiles	Allele frequencies by population group	Hundreds to thousands per group	HWE+LD tested; peer-reviewed; widely published
Glass refractive index	RI measurements from known glass sources	Thousands of measurements	Jurisdiction-specific; validated within lab
Fingerprint minutiae	Large sets of known prints with minutiae coordinates	Hundreds of thousands of prints	Emerging; not uniformly peer-reviewed for court
Ink and paint	Spectral profiles from manufactured products	Thousands of product samples	Manufacturer-dependent; limited inter-lab validation

Consequences of a mismatched reference population

Using a database that does not represent the suspect's population can produce a match probability that is either too low or too high, depending on the direction of the frequency difference. If the suspect belongs to a population in which the relevant alleles are more common than in the database, the calculated match probability understates the true probability of a coincidental match, making the evidence appear more incriminating than the data supports. If the suspect belongs to a population in which the alleles are rarer, the database will produce an overly conservative probability, slightly disadvantaging the prosecution.

The typical magnitude of the error depends on the genetic distance between the database population and the suspect's actual population. For closely related groups, such as two Western European populations, the difference in reported match probabilities is usually less than one order of magnitude. For more distant groups, such as comparing a Western European database to an indigenous Amazonian population, differences of several orders of magnitude have been documented. In practice, the worst-case mismatches arise when a database built from one continental population is used to evaluate evidence from a suspect drawn from a different continent.

Courts in England and Wales addressed this issue in R v Doheny and Adams (1997), which established that the prosecution must specify which database was used and must justify its choice. The US Supreme Court's decisions related to DNA evidence have not directly mandated database choice, but SWGDAM guidelines require laboratories to document and justify their population database selection. Under the Bharatiya Sakshya Adhiniyam 2023 (which replaced the Indian Evidence Act 1872), expert opinion evidence is admissible when the expert is qualified and the basis for the opinion is disclosed: this framework supports, though does not yet specifically require, disclosure of population database choices in DNA cases.

Database validation, maintenance, and expansion

Validation of a forensic reference database involves three categories of checks. First, internal consistency checks confirm that the database's own data is coherent: HWE tests at each locus, linkage disequilibrium tests between loci, and quality control checks on individual sample typing. Second, external comparison tests confirm that the frequency estimates agree with published data from independent studies of the same population. If a newly built Japanese STR database produces allele frequencies that differ substantially from previously published Japanese studies, that is a signal of either sampling bias or typing error. Third, proficiency testing assesses whether analysts using the database produce consistent results: the same evidence profile typed in two different laboratories against the same database should produce the same match probability.

Databases require maintenance because populations change over time. Migration, admixture, and the passage of generations gradually shift allele frequencies. A database built from samples collected in 1995 may not accurately represent the frequency distribution in 2025, particularly in countries with high immigration rates. Most forensic science authorities require periodic review of database currency, though specific re-collection intervals are not universally mandated. The US CODIS database was last substantially updated with the expanded 20-locus panel validated on samples collected in the 2000s and published in 2017.

Expansion of forensic profiling systems requires parallel database expansion. When the UK Forensic Science Service introduced the SGM Plus 10-locus system, new population frequency data had to be collected for the additional loci. When the US moved from 13 to 20 CODIS loci in 2017, the Steffen et al. study provided the new frequency estimates. In both cases, the transition required laboratories to manage a period in which some samples had been typed at the old locus set and others at the new set, and the statistical reporting had to be clearly labelled with which loci were included. The same transition challenge will arise in any jurisdiction that adds new loci to its national profiling system.

Worked example

Evaluating a DNA match probability across three reference databases

A forensic scientist obtains a full 20-locus STR profile from a crime scene bloodstain and finds it matches the suspect. The suspect is of South Asian heritage. Three databases are available: US Caucasian, UK Caucasian, and a published Indian North Indian STR dataset. This example works through which database to use and why.

The scientist must report a random match probability: the chance that an unrelated person drawn from the relevant population would have the same 20-locus profile. The choice of reference population determines the denominator of that statistic.

Identify the relevant population. The suspect is of North Indian heritage. The relevant reference population is therefore the North Indian population, not a US or UK Caucasian population. Allele frequencies at several CODIS loci differ by a factor of two or more between European and South Asian populations.
Assess the available databases. The US Caucasian database (Steffen et al. 2017, n = 361 to 467 per group) does not represent the suspect's population. The UK Caucasian database has the same problem. The published Indian North Indian STR dataset (e.g. Agrawal et al. 2013, n = 200) covers the relevant population but was typed at 15 loci, not 20.
Choose the most appropriate database and document the choice. The Indian North Indian dataset is used for the 15 loci it covers. For the remaining 5 loci, the scientist applies theta = 0.03 using the US Asian American database as the closest available proxy, and documents this substitution explicitly in the case notes.
Apply the theta correction. For each locus, the scientist uses the Balding-Nichols formula with theta = 0.03 to produce a conservative per-locus frequency. The per-locus frequencies are multiplied using the product rule to give a profile frequency.
Check for minimum frequency floors. Two alleles at one locus are not present in the Indian database (n = 200). The scientist applies the minimum frequency floor of 5/(2 x 200) = 0.0125 for those alleles, rather than treating their absence as a frequency of zero.
Report transparently. The court report states: the random match probability in the North Indian population is approximately 1 in X, calculated using the Agrawal et al. 2013 database for 15 loci and the Steffen et al. US Asian American data for 5 loci, with theta = 0.03 applied throughout. The substitution for 5 loci introduces additional uncertainty; a more conservative estimate is also provided using theta = 0.05.

Check your understanding

Question 1 of 4· 0 answered

A reference population database shows Hardy-Weinberg equilibrium departures at three of its fifteen loci. What does this most likely indicate?

Key Takeaways

A valid reference database must represent the right population, contain enough samples to estimate allele frequencies with acceptable uncertainty, and be built using reproducible, documented methods. Databases that fail on any of these criteria produce unreliable match probability statistics.
Population substructure causes allele-sharing within subgroups beyond what random mating predicts. The theta correction (Balding-Nichols formula, theta typically 0.01 to 0.03) adjusts for this by raising the match probability, giving defendants a more conservative estimate when the exact subpopulation is uncertain.
DNA STR databases are the most extensively validated forensic reference collections, with the US CODIS (Steffen et al. 2017) and the UK NDNAD frequency tables as leading examples. Reference databases for glass, fingerprints, and other evidence types are smaller and less uniformly validated.
Using a mismatched reference population typically shifts the reported match probability by one to two orders of magnitude. Courts in England and Wales (post-R v Doheny and Adams), the US, and under India's Bharatiya Sakshya Adhiniyam 2023 framework all require disclosure of which database was used and the basis for its selection.
Databases require ongoing validation, including HWE and linkage disequilibrium testing, external comparison against independent published data, and periodic review for currency as populations change over time. Profiling system expansions require parallel database expansion and a carefully managed transition period.

What is a reference population database in forensic science?

A reference population database is a curated collection of genetic profiles, physical measurements, or other feature frequencies drawn from a defined human population. Forensic analysts use these frequencies to calculate the probability that an unrelated person chosen at random from that population would share the observed evidence feature. The accuracy of the resulting statistic depends directly on how well the database population matches the suspect population.

Why does population substructure matter for forensic match probabilities?

Population substructure means that allele frequencies differ between subgroups within a broader population. If the database is built from one subgroup but the suspect belongs to another, the estimated match probability may be too low or too high. Forensic statisticians use theta correction (also called the coancestry coefficient) to adjust for expected substructure when the exact subpopulation is unknown.

How large does a forensic reference database need to be?

The minimum depends on the rarity of the feature being measured. For DNA profiles, most guidelines require at least 100 to 200 profiles per subpopulation to estimate single-locus allele frequencies reliably. For rarer features, larger samples reduce the uncertainty intervals around the frequency estimate. The NRC II report (1996) and SWGDAM guidelines both address minimum sample size for DNA databases.

What is the effect of using a mismatched reference population?

Using a mismatched reference population typically changes the estimated match probability by one to two orders of magnitude, though the direction depends on whether the suspect's population has higher or lower allele frequencies than the database. Courts in England and Wales, the United States, and elsewhere have examined this issue. Most forensic standards require analysts to report which database was used and to justify the choice.

Are the same population databases used for all forensic evidence types?

No. DNA, fingerprint frequency studies, glass refractive index databases, ink databases, and fibre colour databases each have their own reference collections built from appropriate sample sources. DNA databases are the most extensively validated; databases for other feature types are smaller and often more jurisdiction-specific. The principles of stratification and size adequacy apply across all types, but the specific guidance differs by discipline.

Test yourself on Forensic Statistics with free, timed mocks.

Practice Forensic Statistics questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Key Takeaways

Your journey to becoming a forensic professional starts here.