Practice with mock tests, learn from structured notes, and get your questions answered by a global forensic community, all in one place.
How likelihood ratios are calculated and calibrated for soil and mineralogical comparisons, what the ENFSI verbal equivalence scale means in practice, and how multiple independent tests are combined.
Last updated:
You have run the analysis. You have a multivariate geochemical profile for the questioned sample and for the reference sample from the crime scene. They look similar. Now comes the harder question: how do you turn that similarity into a number a court can use, and how do you describe that number in words that a jury will understand without either overselling the result or underselling it?
The likelihood ratio is the analytical framework forensic science has converged on for answering that question, and the ENFSI verbal equivalence scale is the translating layer between numbers and plain language. Neither is perfect. The LR requires a well-designed reference population and honest calibration. The verbal scale introduces its own inconsistencies when different practitioners use different words for the same number, or when jurors read the same words differently depending on the context.
This topic works through the LR calculation for a soil comparison from first principles, covers calibration and what a Tippett plot shows, describes the ENFSI scale and its documented limitations, and then tackles the practical question of combining LRs from multiple independent tests. A worked numerical example anchors the concepts to a real reporting scenario.
Two hypotheses, two probabilities, one ratio.
The likelihood ratio is simply the probability of the observed evidence given the prosecution hypothesis divided by the probability of the observed evidence given the defence hypothesis. Written out, that is P(E | Hp) / P(E | Hd). Each term requires a different probability estimate, and those estimates come from different parts of the data.
For a soil geochemical comparison, P(E | Hp), the probability of seeing this degree of similarity given the two samples come from the same source, is estimated from within-source variation. The analyst collects replicate samples from the reference location (or uses the spread around a centroid in PCA space) to characterise how much variation is normal within a single source. If the questioned sample falls well within that spread, the numerator is high.
P(E | Hd), the probability of seeing this degree of similarity given the samples come from different sources, is estimated from between-source variation in the reference population. The analyst asks: among all the different-source pairs I can form from my reference collection, how often do two soils from different locations look as similar as the two in this case? If rare, the denominator is low, and the LR is large.
A number is only useful if it means what it says.
Calculating an LR is one thing. Knowing whether it is reliable is another. Calibration is the process of checking that an LR calculation method produces values that actually correspond to the stated frequency difference between the two hypotheses. A method that reports LR = 10,000 for cases that are actually only 100 times more common under the same-source hypothesis than under the different-source hypothesis is badly miscalibrated, and would be misleading to a court.
The Tippett plot provides a visual calibration check. The analyst takes a large set of known same-source pairs and a large set of known different-source pairs, calculates the LR for each, and plots their cumulative distributions on the same axes. A well-calibrated system produces two curves that diverge sharply: the same-source distribution concentrates at high LR values, the different-source distribution concentrates at LR values below 1. If the curves overlap substantially, the method is producing ambiguous LRs in the region of overlap, and that ambiguity should be reported as a limitation.
Translating a number into plain language without losing precision.
The ENFSI verbal equivalence scale was developed to provide a common vocabulary for expert witnesses reporting LR-based evidence conclusions across forensic disciplines. The scale maps LR ranges to verbal descriptors, with LR values above 1 supporting the same-source hypothesis and values below 1 (or equivalently, LR values as 1/x supporting the defence hypothesis).
| LR range | ENFSI verbal descriptor (same-source direction) |
|---|---|
| 1 to 10 | Limited support for the prosecution hypothesis |
| 10 to 100 | Moderate support |
| 100 to 1,000 | Moderately strong support |
| 1,000 to 10,000 | Strong support |
| 10,000 to 1,000,000 | Very strong support |
| Greater than 1,000,000 | Extremely strong support |
The scale is not mandatory, and individual jurisdictions have adopted variations. The UK Forensic Science Regulator's guidance broadly follows the ENFSI scale. The scale is symmetric: an LR of 0.001 (i.e., 1/1,000) would be reported as strong support for the defence hypothesis.
The documented problem with verbal scales is inconsistency. Studies by Champod and Vuille, and by others in the DNA context, found that practitioner choice of verbal term for a given LR varies considerably, and that jurors interpret the same phrase differently depending on the strength of other evidence in the case. For geological evidence specifically, there is very limited empirical data on how jurors process verbal LR descriptions, and this is an open research gap.
Four weak tests can together be powerful, provided they measure different things.
A full soil comparison typically generates multiple test results: colour (Munsell notation), particle-size distribution, mineralogical composition by polarising microscopy, and elemental geochemistry by ICP-MS. Each test gives an LR. The question is whether they can be combined.
The rule for combining LRs by multiplication holds only when the tests are statistically independent. Independence means that knowing the outcome of one test gives no information about the outcome of another, given either hypothesis. Colour reflects both the iron-oxide content and the organic-matter content of a soil; particle size reflects its depositional and weathering history. These are largely controlled by different processes and are often approximately independent. Two elemental concentrations measured in the same ICP-MS run are almost certainly correlated, driven by co-occurring minerals, and should not be treated as independent LRs.
The same word can mean different things to different people in different courtrooms.
The practical value of the ENFSI verbal scale depends on two kinds of consistency: consistency among scientists reporting evidence, and consistency among jurors and judges interpreting it. Both have been challenged in the empirical literature.
On the scientist side, studies in DNA and fingerprint evidence have shown that practitioners given identical numerical LRs disagree on which verbal phrase to apply, particularly near scale boundaries. A practitioner who calculates LR = 950 may call it 'moderately strong support'; another may round up to 'strong support'. The effect on the court's assessment of the evidence depends on how sensitive jurors are to the verbal distinction.
On the juror side, the phrase 'strong support' has been shown in mock-jury studies to be interpreted as a much higher probability statement than the LR value warrants, especially when the surrounding case facts are already incriminating. Jurors can read 'strong support for the prosecution hypothesis' as close to certainty, when an LR of 5,000 with a prior that depends on all other case evidence might still leave meaningful doubt. This is not a problem with the LR itself, it is a communication problem, and it has not been solved.
Following the numbers through a real calculation.
To make the framework concrete, consider a simplified numerical example based on a univariate lead concentration, though in practice the same logic extends to multivariate data via kernel density estimation or parametric distributions.
The reference sample (crime scene) has a lead concentration of 185 mg/kg. The questioned sample (suspect's boot) has a lead concentration of 192 mg/kg. Multiple samples from the crime-scene location (within-source replicates) have a mean of 183 mg/kg and a standard deviation of 12 mg/kg. The reference population (25 samples from surrounding areas) has a mean of 94 mg/kg and a standard deviation of 61 mg/kg.
The probability of the observed data given the same-source hypothesis is 0.04. The probability given the different-source hypothesis is 0.002. What is the LR?
Test yourself on Forensic Geology and Geoforensics with free, timed mocks.
Practice Forensic Geology and Geoforensics questionsSpotted an error in this page? Report a correction or read our editorial standards.