Likelihood Ratios and Verbal Scales in Geological Evidence

Q: What is the numerator of the likelihood ratio in a soil comparison case?

The numerator is the probability of observing the measured degree of similarity between the questioned and reference samples if they truly share a common source. It is typically estimated from the within-source variation: how similar are multiple samples collected from the same location to each other? A tight within-source distribution means even a moderate degree of measured similarity supports the same-source hypothesis strongly.

Q: What does the ENFSI verbal equivalence scale say?

The European Network of Forensic Science Institutes verbal scale maps numerical LR ranges to plain-English phrases. LR values from 1 to 10 are described as 'limited support', 10-100 as 'moderate support', 100-1,000 as 'moderately strong support', 1,000-10,000 as 'strong support', 10,000-1,000,000 as 'very strong support', and above 1,000,000 as 'extremely strong support' for the same-source hypothesis.

Q: Can you multiply LRs from colour, particle size, and geochemistry together?

Only if they are truly independent measurements. Colour and particle size are influenced by different soil properties and are often largely independent. Geochemical elements within a single analytical suite may be correlated with each other, so individual element LRs from the same ICP-MS run should not be multiplied. PCA can be used to check correlation structure before combining.

Q: What does 'calibration' mean in the context of LRs for geological evidence?

Calibration means checking that the LR values an analyst reports actually track reality: when an analyst reports LR = 1,000 for a same-source pair, that event should indeed be about 1,000 times more common under the same-source hypothesis than under the different-source hypothesis. Calibration is assessed using a known-source validation set and tools such as the Tippett plot.

Q: How does verbal scale inconsistency arise in practice?

Studies have shown that different forensic scientists using the same ENFSI verbal scale apply different words to the same numerical LR, and that juries interpret the words differently depending on the surrounding context of the case. Some practitioners argue for reporting LR numbers directly rather than using verbal phrases to reduce this inconsistency.

How likelihood ratios are calculated and calibrated for soil and mineralogical comparisons, what the ENFSI verbal equivalence scale means in practice, and how multiple independent tests are combined.

Last updated: 19 Jun 2026

A likelihood ratio (LR) expresses the probability of the observed forensic evidence under the prosecution hypothesis divided by its probability under the defence hypothesis; for soil and mineral comparisons, the numerator is estimated from within-source variation and the denominator from between-source variation in a reference population. A value above 1 supports the same-source hypothesis; the ENFSI verbal equivalence scale translates the numerical result into standardised phrases such as "moderate support" or "strong support" for use in court reports. Multiple independent test results, colour, particle-size distribution, and geochemistry, can be combined by multiplying their individual LRs, but only after correlation structure has been verified. Calibration, assessed through Tippett plots on known-source validation sets, confirms whether reported LR values correspond to actual frequency differences between the two hypotheses.

When a multivariate geochemical profile from a questioned sample appears to match a reference sample from a crime scene, the analyst faces a precise problem: translating that observed similarity into a number a court can use, then describing that number in words that a jury can interpret accurately.

The likelihood ratio is the analytical framework forensic science uses for this purpose, and the ENFSI verbal equivalence scale is the bridge between numbers and plain language. Neither is perfect. The LR requires a well-designed reference population and honest calibration. The verbal scale introduces its own inconsistencies when different practitioners use different words for the same number, or when jurors read the same words differently depending on the context.

This topic works through the LR calculation for a soil comparison from first principles, covers calibration and what a Tippett plot shows, describes the ENFSI scale and its documented limitations, and then tackles the practical question of combining LRs from multiple independent tests. A worked numerical example anchors the concepts to a real reporting scenario.

By the end of this topic you will be able to:

Calculate the numerator and denominator of a likelihood ratio for a soil comparison using within-source and between-source variation data.
Interpret a Tippett plot to assess whether an LR calculation method is well-calibrated.
Apply the ENFSI verbal equivalence scale correctly and identify its documented limitations regarding practitioner inconsistency and juror interpretation.
Determine when LRs from multiple forensic tests may be multiplied and when correlated measurements must be combined into a single multivariate LR instead.
Construct a complete LR report for a soil case, including the numerical value, the corresponding ENFSI phrase, reference population size, and independence justification.

Key terms

Numerator hypothesis: The prosecution hypothesis under which the LR numerator is calculated. In a source comparison it is typically: the questioned sample and the reference sample come from the same source location.
Denominator hypothesis: The defence hypothesis under which the LR denominator is calculated. Typically: the questioned sample comes from some other location within the relevant reference population.
Within-source variation: The natural variation observed among multiple samples collected from a single source location. Used to estimate the probability of the observed difference between questioned and reference samples under the same-source hypothesis.
Between-source variation: The variation observed among samples from different locations in the reference population. Used to estimate the probability of the observed similarity under the different-source hypothesis.
Tippett plot: A calibration diagnostic plot that shows the cumulative distributions of LR values calculated for known same-source pairs and known different-source pairs. Well-calibrated systems show the two curves diverging strongly; overlapping curves indicate poor discrimination.
ENFSI verbal scale: A recommended scale published by the European Network of Forensic Science Institutes that maps ranges of LR values to standardised verbal phrases such as 'moderate support', 'strong support', and 'extremely strong support' for either the prosecution or defence hypothesis.

The LR framework from first principles

The likelihood ratio is simply the probability of the observed evidence given the prosecution hypothesis divided by the probability of the observed evidence given the defence hypothesis. Written out, that is P(E | Hp) / P(E | Hd). Each term requires a different probability estimate, and those estimates come from different parts of the data.

For a soil geochemical comparison, P(E | Hp), the probability of seeing this degree of similarity given the two samples come from the same source, is estimated from within-source variation. The analyst collects replicate samples from the reference location (or uses the spread around a centroid in PCA space) to characterise how much variation is normal within a single source. If the questioned sample falls well within that spread, the numerator is high.

P(E | Hd), the probability of seeing this degree of similarity given the samples come from different sources, is estimated from between-source variation in the reference population. The analyst asks: among all the different-source pairs I can form from my reference collection, how often do two soils from different locations look as similar as the two in this case? If rare, the denominator is low, and the LR is large.

Similarity score distributions for same-source and different-source pairs. — Conceptual distributions of similarity scores for same-source and different-source pairs. A large LR results when the questioned-versus-reference similarity score falls in the region where same-source pairs are common and different-source pairs are rare.

Calibration and the Tippett plot

Calculating an LR and knowing whether it is reliable are distinct problems. Calibration is the process of checking that an LR calculation method produces values that actually correspond to the stated frequency difference between the two hypotheses. A method that reports LR = 10,000 for cases that are actually only 100 times more common under the same-source hypothesis is miscalibrated and misleading to a court.

The Tippett plot provides a visual calibration check. The analyst takes a large set of known same-source pairs and a large set of known different-source pairs, calculates the LR for each, and plots their cumulative distributions on the same axes. A well-calibrated system produces two curves that diverge sharply: the same-source distribution concentrates at high LR values, the different-source distribution concentrates at LR values below 1. If the curves overlap substantially, the method is producing ambiguous LRs in the region of overlap, and that ambiguity should be reported as a limitation.

The ENFSI verbal equivalence scale

The ENFSI verbal equivalence scale was developed to provide a common vocabulary for expert witnesses reporting LR-based evidence conclusions across forensic disciplines. The scale maps LR ranges to verbal descriptors, with LR values above 1 supporting the same-source hypothesis and values below 1 (or equivalently, LR values as 1/x supporting the defence hypothesis).

LR range	ENFSI verbal descriptor (same-source direction)
1 to 10	Limited support for the prosecution hypothesis
10 to 100	Moderate support
100 to 1,000	Moderately strong support
1,000 to 10,000	Strong support
10,000 to 1,000,000	Very strong support
Greater than 1,000,000	Extremely strong support

The scale is not mandatory, and individual jurisdictions have adopted variations. The UK Forensic Science Regulator's guidance broadly follows the ENFSI scale. The scale is symmetric: an LR of 0.001 (i.e., 1/1,000) would be reported as strong support for the defence hypothesis.

The documented problem with verbal scales is inconsistency. Studies by Champod and Vuille, and by others in the DNA context, found that practitioner choice of verbal term for a given LR varies considerably, and that jurors interpret the same phrase differently depending on the strength of other evidence in the case. For geological evidence specifically, there is very limited empirical data on how jurors process verbal LR descriptions, and this is an open research gap.

Combining LRs from multiple independent tests

A full soil comparison typically generates multiple test results: colour (Munsell notation), particle-size distribution, mineralogical composition by polarising microscopy, and elemental geochemistry by ICP-MS. Each test gives an LR. The question is whether they can be combined.

The rule for combining LRs by multiplication holds only when the tests are statistically independent. Independence means that knowing the outcome of one test gives no information about the outcome of another, given either hypothesis. Colour reflects both the iron-oxide content and the organic-matter content of a soil; particle size reflects its depositional and weathering history. These are largely controlled by different processes and are often approximately independent. Two elemental concentrations measured in the same ICP-MS run are almost certainly correlated, driven by co-occurring minerals, and should not be treated as independent LRs.

Assess independence
Run a PCA or correlation matrix on the reference population data. Tests measuring correlated properties should be combined into a single multivariate LR rather than multiplied as separate values.
Calculate LR per independent test
Estimate each LR separately using the within-source and between-source variation for that test. Report the individual values before combining.
Multiply independent LRs
Where independence has been justified, multiply. A colour LR of 20, a particle-size LR of 15, and a mineralogy LR of 40 combine to approximately 12,000, which falls in the 'strong support' range of the ENFSI scale.
Report with uncertainty
The combined LR inherits the uncertainty of each component estimate. Report the approximate value and note the key assumptions, particularly the independence assumption and the size of the reference population.

Combining independent LRs: colour, particle size, and mineralogy multiply to a combined value. — Combining independent LRs by multiplication. Each test provides a separate weight of evidence; the combined LR is their product, valid only when independence is demonstrated.

Critique of verbal scale consistency

The practical value of the ENFSI verbal scale depends on two kinds of consistency: consistency among scientists reporting evidence, and consistency among jurors and judges interpreting it. Both have been challenged in the empirical literature.

On the scientist side, studies in DNA and fingerprint evidence have shown that practitioners given identical numerical LRs disagree on which verbal phrase to apply, particularly near scale boundaries. A practitioner who calculates LR = 950 may call it 'moderately strong support'; another may round up to 'strong support'. The effect on the court's assessment of the evidence depends on how sensitive jurors are to the verbal distinction.

On the juror side, the phrase 'strong support' has been shown in mock-jury studies to be interpreted as a much higher probability statement than the LR value warrants, especially when the surrounding case facts are already incriminating. Jurors can read 'strong support for the prosecution hypothesis' as close to certainty, when an LR of 5,000 with a prior that depends on all other case evidence might still leave meaningful doubt. This is not a problem with the LR itself, it is a communication problem, and it has not been solved.

A case example: LR calculation for a geochemical soil match

The following numerical example uses a univariate lead concentration for clarity; the same logic extends to multivariate data via kernel density estimation or parametric distributions.

The reference sample (crime scene) has a lead concentration of 185 mg/kg. The questioned sample (suspect's boot) has a lead concentration of 192 mg/kg. Multiple samples from the crime-scene location (within-source replicates) have a mean of 183 mg/kg and a standard deviation of 12 mg/kg. The reference population (25 samples from surrounding areas) has a mean of 94 mg/kg and a standard deviation of 61 mg/kg.

Numerator: the probability of observing a value of 192 mg/kg given the within-source distribution N(183, 12). Using a normal probability density function, this evaluates to approximately 0.027.
Denominator: the probability of observing a value of 192 mg/kg given the reference population distribution N(94, 61). Using the same function, this evaluates to approximately 0.0022.
LR: 0.027 / 0.0022 = approximately 12. This falls in the 'moderate support' category on the ENFSI scale.
Interpretation: lead concentration alone provides only moderate support for the same-source hypothesis. Lead is elevated in both samples relative to the area background, which helps, but the within-source and between-source distributions overlap considerably. If mineralogy and particle size provided independent LRs of 30 and 25 respectively, the combined LR would be 12 x 30 x 25 = 9,000, placing the combined evidence firmly in 'strong support'.

Worked example

Full LR reporting for a burglary soil case

From sample analysis to court-ready verbal conclusion.

A residential burglary. Wet soil marks are found on a kitchen floor and sampled. The suspect's boots are seized two days later with adhering soil. The analyst performs colour matching (Munsell), particle-size distribution, and ICP-MS for fifteen elements.

Colour: the Munsell colour of the boot soil (10YR 4/3) matches the kitchen-floor sample closely. The analyst estimates LR(colour) = 25 based on the frequency of this Munsell combination in the 30-sample reference collection from the surrounding neighbourhood.
Particle size: both samples show a bimodal distribution with a silt peak at 20 microns and a fine-sand peak at 120 microns. This is unusual in the reference population. LR(PSD) = 60.
Geochemistry: the analyst runs PCA on the fifteen-element ICP-MS data for all 32 samples (questioned, reference, and 30 population). The questioned and reference samples cluster together and separate from the population. A multivariate LR (treating the fifteen elements as a single multivariate observation) = 380.
Independence check: colour and particle size are controlled by different soil properties and a correlation matrix shows r < 0.3 between the colour principal component and the particle-size bimodality score. Geochemistry is treated as a single multivariate test, not fifteen independent elements. Multiplication is justified.
Combined LR: 25 x 60 x 380 = 570,000. ENFSI verbal descriptor: 'very strong support for the hypothesis that both soil samples came from the same source'.

The report states the numerical LR, the ENFSI verbal phrase, the reference population size and sampling strategy, the independence justification, and the key assumption that the reference population represents the plausible alternative sources given the case facts. The analyst is available to explain each step under cross-examination.

Check your understanding

Question 1 of 4· 0 answered

The probability of the observed data given the same-source hypothesis is 0.04. The probability given the different-source hypothesis is 0.002. What is the LR?

Key Takeaways

The LR numerator is estimated from within-source variation (how similar are replicates from the same location?); the denominator is estimated from between-source variation in the reference population.
Calibration, assessed using Tippett plots on known-source validation sets, confirms that reported LR values track actual frequency differences and are not systematically inflated or deflated.
The ENFSI verbal scale translates numerical LR ranges into standardised phrases; both the number and the phrase should be reported, with an explicit explanation of what each means.
Multiple LRs can only be multiplied when the tests are statistically independent; correlation structure should be assessed by PCA or correlation matrix before combination.
Verbal scale inconsistency between practitioners and misinterpretation by jurors are documented problems; the best mitigation is explicit explanation of the framework in the report and availability for cross-examination.

What is the numerator of the likelihood ratio in a soil comparison case?

The numerator is the probability of observing the measured degree of similarity between the questioned and reference samples if they truly share a common source. It is typically estimated from the within-source variation: how similar are multiple samples collected from the same location to each other? A tight within-source distribution means even a moderate degree of measured similarity supports the same-source hypothesis strongly.

What does the ENFSI verbal equivalence scale say?

The European Network of Forensic Science Institutes verbal scale maps numerical LR ranges to plain-English phrases. LR values from 1 to 10 are described as 'limited support', 10-100 as 'moderate support', 100-1,000 as 'moderately strong support', 1,000-10,000 as 'strong support', 10,000-1,000,000 as 'very strong support', and above 1,000,000 as 'extremely strong support' for the same-source hypothesis.

Can you multiply LRs from colour, particle size, and geochemistry together?

Only if they are truly independent measurements. Colour and particle size are influenced by different soil properties and are often largely independent. Geochemical elements within a single analytical suite may be correlated with each other, so individual element LRs from the same ICP-MS run should not be multiplied. PCA can be used to check correlation structure before combining.

What does 'calibration' mean in the context of LRs for geological evidence?

Calibration means checking that the LR values an analyst reports actually track reality: when an analyst reports LR = 1,000 for a same-source pair, that event should indeed be about 1,000 times more common under the same-source hypothesis than under the different-source hypothesis. Calibration is assessed using a known-source validation set and tools such as the Tippett plot.

How does verbal scale inconsistency arise in practice?

Studies have shown that different forensic scientists using the same ENFSI verbal scale apply different words to the same numerical LR, and that juries interpret the words differently depending on the surrounding context of the case. Some practitioners argue for reporting LR numbers directly rather than using verbal phrases to reduce this inconsistency.

Test yourself on Forensic Geology and Geoforensics with free, timed mocks.

Practice Forensic Geology and Geoforensics questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Key Takeaways

Your journey to becoming a forensic professional starts here.