Statistical Comparison and the Reference Population Problem

How forensic geologists decide whether two soil samples share a common source, using classical statistics and likelihood ratios, and why the reference population is the hardest part of that judgment.

Last updated: 19 Jun 2026

In forensic geology, deciding whether two soil samples share a common source requires a statistical framework built around a reference population: a collection of soils from the broader area that represents what unrelated sources look like. The comparison proceeds in two stages: multivariate methods such as principal component analysis (PCA) and linear discriminant analysis (LDA) reduce high-dimensional geochemical data to a form that shows whether the questioned sample clusters with the scene sample, and a likelihood ratio (LR) then quantifies how much more probable that degree of similarity is under a same-source hypothesis than under a different-source hypothesis. The reference population defines the denominator of that LR, and its design, how broadly it is drawn, how it is sampled, and how faithfully it reflects local geological diversity, determines the reliability of the entire evaluation.

Two soil samples, one from a suspect's boot, one from a crime scene, pose the central question of forensic soil comparison: could these have come from the same patch of ground? Translating that question into a number that holds up in court requires statistics, and the reliability of those statistics depends entirely on the reference population built to surround them.

This is the core methodological challenge in forensic geology. Soil composition varies at the metre scale, which is its strength as forensic evidence. But that same variability means that deciding what counts as a meaningful match, and what counts as coincidence, requires a careful statistical framework. The field has moved from simple profile comparisons and visual colour matching toward multivariate analysis and likelihood ratios, but the reference population problem runs through all of it.

This topic works through the main statistical tools used in soil evidence: principal component analysis and linear discriminant analysis for making sense of multi-element geochemical data, Bayesian likelihood ratios for expressing the weight of a match, and the honest accounting of false-positive rates. It pays particular attention to the reference population question, because that is where most of the real scientific debate happens and where opposing experts tend to fight hardest.

By the end of this topic you will be able to:

Define the reference population concept and explain how its scope and sampling strategy affect both the numerator and denominator of the likelihood ratio.
Apply PCA and LDA to multi-element geochemical soil data, distinguishing their supervised/unsupervised character and their appropriate uses in casework.
Calculate or interpret a multivariate likelihood ratio for a soil comparison, and correctly identify the prosecutor's fallacy when an LR is misrepresented as a posterior probability.
Evaluate false-positive and false-negative rates in soil comparison studies and explain how geological uniformity, method choice, and sampling timing affect those rates.
Design a reference population sampling strategy for a given case geography and identify the limitations that must be disclosed in the expert report.

Key terms

Reference population: The set of soils sampled from the broader area to represent what 'other possible sources' look like. Its composition determines the denominator of the likelihood ratio and therefore the strength of any match conclusion.
Principal Component Analysis (PCA): A technique that reduces a high-dimensional data set (e.g., twenty elemental concentrations per sample) to a smaller set of uncorrelated components that retain most of the variance, making clustering and separation visible.
Linear Discriminant Analysis (LDA): A supervised classification method that finds the linear combination of features that best separates known groups. In forensic geology it is used to classify a questioned sample into one of several source areas.
Likelihood Ratio (LR): The ratio of the probability of the observed data given a prosecution hypothesis (same source) to the probability given a defence hypothesis (different source). The LR quantifies the evidential weight without encoding a prior probability of guilt.
Mahalanobis distance: A multivariate distance measure that accounts for correlations between variables and scales each dimension by its variance, allowing comparison of data points in a consistent way regardless of measurement units.
False-positive rate: The proportion of cases in which two samples from genuinely different sources are incorrectly classified as sharing a source. Depends on the statistical method used, the reference population, and the geological diversity of the region.

The comparison question and why it is hard

Forensic comparison always asks a two-part question: are these samples similar, and if so, is that similarity unusual or commonplace? The first part is chemistry and mineralogy. The second part is statistics, and it depends entirely on knowing how the sample compares to what else is out there.

Soil makes the second question genuinely hard. Geochemical composition can vary dramatically over a few metres where a soil boundary crosses a geological contact, a drainage channel, or a fill deposit. That variability gives soil its discriminating power: two samples taken from different locations can often be distinguished. But it also means the reference population that represents 'anywhere else' must be carefully defined and sampled. A poorly designed reference collection can both inflate and deflate the apparent rarity of a match.

Published studies have examined how misclassification rates change with reference-population design. Morgan and Pringle (2012) tested soil discrimination across a range of English landscapes and found that misclassification rates ranged from under 5% in geologically diverse areas to over 20% in geologically uniform ones. Pye and Blott have demonstrated similar variation across different analytical protocols. These are not theoretical concerns. They translate directly into the strength of a court opinion.

PCA and LDA for multi-element soil data

A modern soil analysis by inductively coupled plasma mass spectrometry (ICP-MS) can measure thirty or more elements in a single sample. Each element is a dimension, and a case involving a questioned sample and a hundred reference samples is a cloud of points in thirty-dimensional space. No human can visualise that. PCA is the standard tool for collapsing it into something interpretable.

PCA computes new axes, called principal components, that are linear combinations of the original variables and are ordered by how much variance they capture. The first principal component might account for 45% of the total variation in a geochemical dataset, the second another 20%, and so on. By plotting samples along the first two or three components, an analyst can see whether the questioned sample clusters with the reference samples from the suspect's alleged location or sits apart.

PCA score plot: questioned sample clusters with source group A, not source group B. — PCA score plot showing two soil source groups and a questioned sample. PC1 separates the groups on a geology-controlled axis; the questioned sample falls within the suspected source cluster.

LDA goes one step further. Where PCA is unsupervised (it does not know which samples belong to which group), LDA is trained on labelled samples to find the axis that maximally separates known classes. If an analyst has well-characterised samples from several distinct source areas and wants to classify a questioned sample into one of them, LDA provides a principled way to make that assignment along with a posterior probability.

Method	Supervised?	Output	Best use in casework
PCA	No	Score plot, explained variance	Visualising clustering, flagging outliers
LDA	Yes	Class assignment, posterior probability	Classifying questioned sample into defined source areas
Mahalanobis distance	No (uses group stats)	Distance from group centroid	Flagging whether a sample is within group scatter
Likelihood ratio (multivariate)	No	Numerical weight of evidence	Formal court-ready evaluation of the match strength

Bayesian source attribution and the likelihood ratio

The likelihood ratio is now the preferred reporting framework in forensic science across multiple disciplines, including soil comparison. Its appeal is that it makes explicit what question the science can answer and what question it cannot. The scientist answers: given the observed geochemical and mineralogical data, how much more probable is this degree of similarity under the hypothesis that both samples came from the same location than under the hypothesis that they came from different locations? The court, not the scientist, then weighs that number against the rest of the evidence.

Formally, the LR is the ratio of two probabilities. The numerator is the probability of the observed data if the prosecution's hypothesis is correct (same source). The denominator is the probability of the observed data if the defence hypothesis is correct (different source). The denominator requires sampling the reference population to estimate how often soils from unrelated locations look as similar as the two in question. This is exactly where the reference population design matters most.

LR structure for a soil comparison: numerator over denominator, denominator requires reference population. — Structure of the likelihood ratio for a geochemical soil match. The numerator and denominator each require a separate probability estimate; the denominator depends on the reference population.

LR values in soil cases reported in the literature range widely, from modest values around 10-100 in geologically uniform areas to values exceeding 100,000 in cases where the questioned soil contains a highly distinctive mineralogical assemblage. The ENFSI verbal equivalence scale (discussed in the next topic) provides a way to communicate these magnitudes in plain language without misrepresenting the precision of the estimate.

Defining the reference population

There is no universal rule for how large a reference population must be or how it should be sampled. The answer depends on the case geography, the geological diversity of the area, and the analytical methods used. A case where the questioned site is in the middle of a geologically uniform glacial plain needs a different approach from one where the site sits on a narrow band of serpentinite surrounded by quite different geology.

Define the relevant area
Starting from the known facts of the case, identify the geographic area that represents plausible alternative sources. This might be a radius around the crime scene, a transit corridor, or an urban catchment area.
Sample systematically
A grid or random stratified sample across the defined area gives a reference collection that reflects the geological diversity within it. Opportunistic sampling biases the denominator toward accessible locations.
Analyse identically
Reference samples must be prepared and analysed using exactly the same protocol as the evidential samples. Differences in preparation introduce analytical variance that is indistinguishable from real geochemical differences.
Validate the distribution
Check that the reference collection captures the range and frequency distribution of the analytical parameters. If the questioned sample falls at the extreme of the reference distribution, the LR estimate in that region is uncertain.

The gap between what is ideal and what is achievable in casework is real. A forensic geologist may receive samples months after an incident, by which time seasonal changes have altered surface chemistry and the ability to sample the exact relevant area may be limited by access restrictions or development. The report must be honest about these limitations and what they mean for the precision of the LR estimate.

False-positive and false-negative rates in soil comparison

One of the Daubert criteria for admissibility of expert scientific testimony is a known or knowable error rate. For geological soil evidence, this means false-positive rates (two samples from different sources classified as matching) and false-negative rates (two samples from the same source classified as non-matching) must be estimated from empirical testing, not asserted from first principles.

Published validation studies provide benchmarks. Pye and Blott conducted blind trials comparing colour, particle size, and geochemical methods across diverse English landscapes and reported correct classification rates of 80-95% depending on method and geology. Morgan and Pringle used a known-source soil set from English agricultural and urban environments and found that multi-element ICP-MS data with LDA gave the lowest false-positive rates, typically below 10% in geologically diverse regions.

Geologically uniform areas (e.g., glacial till plains) produce higher false-positive rates because soils over a large area look chemically similar.
Geologically diverse areas (e.g., mineralised terrains, coastal margins with varied inputs) produce lower false-positive rates because even nearby sources differ.
Method matters: colour matching alone has higher error rates than combined geochemical and mineralogical profiling.
Seasonal and weather-related variation can shift surface chemistry, increasing apparent false-positive rates when samples are collected at different times.

Presenting statistics in court

Statistical outputs from PCA, LDA, or LR calculations are not self-explanatory to a jury or a judge with no statistical training. The forensic geologist's job is to translate them honestly without either overstating the precision of the estimate or understating the weight of genuine evidence.

The LR framework helps because it separates the scientific assessment from the legal conclusion. The scientist says: the observed similarity is X times more probable if both samples came from the same location than if they came from different locations. The court then combines that with everything else it knows about the case. The scientist does not say 'these samples match' as though that settles the question, nor do they say 'it is probable that the suspect was at the scene,' which conflates the LR with a posterior probability.

Worked example

Geochemical comparison in a hit-and-run investigation

From boot soil to LR: a soil comparison workflow from scene to report.

A pedestrian is struck and killed on a rural road. The suspect vehicle is found two weeks later. Soil adhering to the underbody is sampled. Police also sample the verge at the incident site and provide the analyst with a map of the surrounding area. The analyst needs to assess whether the vehicle soil could have originated from the scene verge.

Sample preparation: both the questioned (vehicle) soil and the reference (scene verge) soil are dried, disaggregated, and prepared identically for ICP-MS elemental analysis and particle-size distribution measurement.
Reference population sampling: the analyst arranges for twenty-four soil samples to be collected on a 500-metre grid centred on the scene. This captures the geological diversity of the catchment area the vehicle might plausibly have passed through.
PCA: all twenty-six samples (twenty-four reference, one scene, one questioned) are plotted in PCA space. The questioned sample clusters with the scene sample and the three nearest reference samples from the immediately surrounding area.
LR calculation: using the multivariate geochemical data and the twenty-four reference samples as the denominator population, the analyst calculates an LR of approximately 800. In the ENFSI verbal scale this falls in the 'strong support' category.
Reporting: the report states the LR value and explains that the composition of the questioned soil is 800 times more probable if it originated from the scene verge than if it originated from any other location within the sampled reference area. The limitations are recorded: the reference population was sampled two weeks post-incident; no account could be taken of transport-related mixing.

The report does not say the vehicle was at the scene. It quantifies the soil evidence. That number, together with tyre-track evidence and witness testimony placing the vehicle in the area, builds the case. The statistics did their job by isolating what the soil alone can contribute.

Check your understanding

Question 1 of 4· 0 answered

Why is the reference population the hardest part of the forensic soil comparison?

Key Takeaways

The reference population defines the denominator of the likelihood ratio; designing it appropriately for the case geography is the central methodological challenge in forensic soil comparison.
PCA reduces multi-element geochemical data to a visualisable space for clustering analysis; LDA uses labelled training data to classify questioned samples into source areas.
The likelihood ratio framework separates the scientific weight of evidence from the legal conclusion, avoiding the prosecutor's fallacy that conflates LR with posterior probability of guilt.
Published error-rate studies (Morgan and Pringle; Pye and Blott) show that false-positive rates vary with geology, method, and reference population design, and must be reported honestly.
Multiple independent tests (colour, particle size, mineralogy, geochemistry) can be combined by multiplying their LRs, provided independence is demonstrated and not merely assumed.

What is the reference population problem in forensic soil comparison?

The reference population is the set of soils a questioned sample is compared against to gauge how unusual a match is. Defining it is the hardest step: too narrow and the denominator is unrealistically small, making the match look more exclusive than it is; too broad and you dilute real discriminating power. There is no universal rule, and analysts must justify their sampling strategy case by case.

Why is PCA used in forensic soil chemistry?

A soil sample measured for twenty or more elements produces a high-dimensional data point that is hard to visualise or compare directly. PCA reduces that space to a handful of principal components that capture most of the variation. Samples from the same source cluster together in the reduced space; samples from different sources sit apart. It is a way of summarising complex chemistry in a form a court can understand.

What does a likelihood ratio of 1000 mean in a soil case?

It means the observed level of similarity between the questioned and reference sample is 1000 times more probable if they share a common source than if they do not. It is not a probability that the samples match. It is a weight of evidence that the court combines with other information, including non-forensic evidence, to reach a verdict.

Can false-positive rates be calculated for geochemical soil matches?

Yes, but they depend on the reference population chosen. Studies by Morgan and Pringle, and by Pye and Blott, have measured misclassification rates using known-source sample sets. Rates vary considerably with the methods used and the geological diversity of the comparison area, which is one reason the reference population definition is so contentious.

What is the difference between a classical discrimination approach and a likelihood ratio approach?

Classical discrimination asks whether a questioned sample falls within the distribution of a reference population using a fixed threshold, typically a p-value or a Mahalanobis distance cut-off. The likelihood ratio approach asks how much more probable the data are under a same-source hypothesis compared with a different-source hypothesis. The LR framework is preferred in forensic science because it separates the scientific assessment from the legal threshold and communicates uncertainty more honestly.

Test yourself on Forensic Geology and Geoforensics with free, timed mocks.

Practice Forensic Geology and Geoforensics questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Key Takeaways

Your journey to becoming a forensic professional starts here.