Strength of Evidence and Likelihood Ratio Scales

Forensic scientists and courts need a shared vocabulary for translating a numerical likelihood ratio into language that judges and juries can understand. This topic examines the numerical scales used to describe evidential strength, from the Jeffreys scale to laboratory-adopted verbal equivalents, and discusses their mathematical basis, practical limitations, and the controversies they continue to generate.

Last updated: 24 Jun 2026

A likelihood ratio scale is a mapping from a numerical LR value to a verbal phrase that communicates evidential strength in plain language. When a forensic scientist calculates or estimates a likelihood ratio for a piece of evidence, the raw number conveys meaning to a statistician but is difficult to communicate directly to a judge or a jury. Scales such as the Jeffreys scale and the verbal equivalence tables adopted by forensic laboratories provide a standardised translation: an LR of 1000 might correspond to 'strong support for the prosecution hypothesis', while an LR of 10 might correspond to 'limited support'. The verbal phrase travels into the courtroom; the scale is the contract between the number and the phrase.

The idea of ordering evidence by strength is older than statistics itself, but rigorous numerical frameworks only emerged in the twentieth century. Harold Jeffreys proposed a logarithmic scale for Bayes factors in 1961, and forensic scientists later adapted it for likelihood ratios derived from trace evidence, DNA profiles, and other forensic comparisons. Since then, standards bodies including the European Network of Forensic Science Institutes (ENFSI), the Association of Forensic Science Providers (AFSP) in the UK, and equivalent bodies in Australia and North America have developed their own recommended scales, some converging on Jeffreys-inspired logarithmic boundaries and some departing from them.

The debate over which scale to use, and whether any verbal scale is appropriate at all, reflects genuine scientific disagreement. Critics argue that any fixed mapping between an LR range and a verbal label creates the illusion of precision where none exists, and that verbal phrases are inevitably misread by courts. Proponents argue that courts cannot function with raw probability ratios and that a well-defined scale with accompanying explanation is both honest and practical. Understanding both positions, and knowing the current guidance from major standards bodies, is essential for any scientist who writes evaluative reports.

By the end of this topic you will be able to:

Explain what a verbal equivalence scale does and why one is needed in evaluative reporting.
Describe the Jeffreys scale, its logarithmic structure, and the reasons forensic scientists initially adopted it.
Identify the scale levels recommended by the ENFSI evaluative reporting guideline and compare them with the AFSP scale.
Evaluate the main criticisms of verbal scales, including the transposition fallacy risk and the problem of arbitrary boundaries.
Apply the principle that a verbal phrase in a report must be accompanied by an explanation of the underlying LR framework and the scale used.

Key terms

Likelihood ratio (LR): The ratio of the probability of the evidence given one hypothesis to the probability of the evidence given an alternative hypothesis. In forensic evaluative reporting, the LR measures how much more (or less) probable the observed evidence is under the prosecution hypothesis than under the defence hypothesis.
Verbal equivalence scale: A table that assigns a verbal phrase to a range of LR values or log10(LR) values. The phrase is used in written forensic reports to communicate evidential strength to non-specialist readers. The table itself must be defined and justified by the issuing laboratory or standards body.
Jeffreys scale: A logarithmic scale for Bayes factors proposed by Harold Jeffreys in Theory of Probability (1961). Levels run from 'barely worth mentioning' at log10(BF) between 0 and 0.5, through 'substantial', 'strong', and 'very strong', to 'decisive' at log10(BF) greater than 2. Widely referenced in forensic science but rarely applied without modification.
ENFSI guideline: The European Network of Forensic Science Institutes Guideline for Evaluative Reporting in Forensic Science (2015, updated 2016). It defines the propositions framework, the LR framework, and a recommended seven-level verbal scale, and is one of the primary international standards for evaluative reporting.
Transposition fallacy: The error of treating the probability of the evidence given a hypothesis as though it were the probability of the hypothesis given the evidence. In the context of LR scales, the transposition fallacy arises when a jury reads 'strong support for the prosecution hypothesis' as meaning the defendant is probably guilty, rather than understanding it as a statement about the relative probability of the evidence.
Log10(LR): The base-10 logarithm of the likelihood ratio, sometimes called the weight of evidence (following I.J. Good). A log10(LR) of 1 corresponds to LR = 10, log10(LR) of 2 to LR = 100, and log10(LR) of 3 to LR = 1000. Logarithmic representation is used in verbal scales because LR values span many orders of magnitude.

Why verbal scales exist: the communication problem

A forensic scientist who reports 'the likelihood ratio for this DNA profile is 4.7 x 10^12' has communicated accurately but not usefully to most triers of fact. The number is simultaneously too large to be intuitive and too abstract to carry emotional weight without a reference point. Courts in England and Wales, Australia, the Netherlands, and elsewhere have consistently found that raw LR figures, without contextualisation, are not well understood by jurors or even by some judges.

Verbal equivalence scales address this by providing a bridge between the statistical output and natural language. The scientist calculates or estimates the LR, locates it on the laboratory's agreed scale, and reports the corresponding phrase, such as 'very strong support for the proposition that the DNA originated from the suspect'. The phrase is more accessible than the number, and when accompanied by an explanation of what 'support' means in the LR framework, it allows a court to understand both the direction and the rough magnitude of the evidence.

The price of accessibility is potential misinterpretation. A phrase that says 'strong support for the prosecution hypothesis' can be misread as a statement about the defendant's guilt rather than about the relative probability of the evidence under two competing hypotheses. This risk is not hypothetical. Documented cases in the UK, Australia, and elsewhere include instances where evaluative language was misunderstood at trial, contributing to the broader reform of forensic reporting standards that has taken place since the 2000s.

The Jeffreys scale: origin and structure

Harold Jeffreys was a British mathematician and geophysicist whose 1961 book Theory of Probability set out a Bayesian framework for scientific inference. As part of that framework, he proposed a scale for interpreting Bayes factors: the ratio of the probability of the data under one hypothesis to the probability of the data under another. His scale used log10 of the Bayes factor as the primary axis, and assigned verbal labels to successive intervals.

log10(LR)	LR range	Jeffreys label	Forensic adaptation label (common)
0 to 0.5	1 to 3	Barely worth mentioning	Limited / weak support
0.5 to 1	3 to 10	Substantial	Moderate support
1 to 1.5	10 to 32	Strong	Moderate to strong support
1.5 to 2	32 to 100	Very strong	Strong support
> 2	> 100	Decisive	Very strong / extremely strong support

Jeffreys did not design his scale for forensic use. He was concerned with hypothesis testing in scientific publication, not with courtroom communication. Forensic scientists adopted and adapted the scale because it was already well-known in Bayesian statistics and its logarithmic structure was convenient: LR values in forensic science commonly span from 1 to 10^15 or beyond, and a log scale compresses this range into a manageable set of intervals.

The adaptations introduced by forensic scientists have varied. Some laboratories apply the original Jeffreys boundaries without change. Others shift the boundaries upward on the basis that forensic decisions carry higher stakes than scientific publication and that a higher LR should be required before using a phrase like 'strong'. Still others add levels at the high end of the scale to distinguish between an LR of 10^6 and an LR of 10^12, which are both 'decisive' on the original Jeffreys scale but represent very different magnitudes of evidence.

ENFSI and AFSP scales: current standards

The ENFSI Guideline for Evaluative Reporting in Forensic Science, first published in 2015, provides the most widely referenced European standard for LR-based reporting. Its verbal scale runs from 'limited support' at LR values just above 1 to 'overwhelmingly strong support' at the highest LR values. The guideline explicitly states that the verbal terms are directional, toward either the prosecution or defence proposition, and that they describe the strength of the evidence, not the probability of guilt.

The ENFSI guideline's critical contribution is not the specific words chosen but the framework requirements surrounding them. A laboratory using the ENFSI scale must: define the propositions at an appropriate level (offence-level or activity-level), state the conditioning assumptions, specify which scale is being applied and why, and ensure that the evaluative statement is accompanied by text explaining what the LR framework means. The scale without the framework is incomplete.

The Association of Forensic Science Providers in the UK published a scale in 2009 that has been influential in British laboratory practice. It uses seven levels with approximate boundaries at log10(LR) = 1, 2, 4, 5, and 6, meaning that an LR of 10^5 is 'strong support' and an LR of 10^6 is 'very strong support'. Other countries have developed their own variants. The Netherlands Forensic Institute has published its own scale; the Australian and New Zealand Forensic Science Society has produced guidelines that reflect the ENFSI framework but are adapted to Australian court requirements.

Criticisms of verbal scales

The most fundamental criticism of verbal scales is that the boundary values between levels are arbitrary. There is no mathematical reason why log10(LR) = 2 should mark the transition from one verbal category to the next. The boundaries are consensus choices made by expert committees, and different committees have made different choices. Critics argue that this arbitrariness is hidden when the verbal phrase is presented in a report without the underlying LR value, because the reader cannot know how close the actual LR was to the boundary.

A second criticism concerns the transposition fallacy risk. Experimental studies, including jury simulation experiments conducted in the UK and Australia, have found that verbal phrases such as 'strong support for the prosecution hypothesis' are frequently misread as statements about the probability of guilt. Even legally trained readers show this pattern. The risk is not eliminated by including an explanation of the LR framework in the report, because jurors under time pressure or cognitive load may attend to the verbal phrase and ignore the explanation.

A third criticism is that verbal scales can obscure large differences in evidential strength within a single category. Under most scales, an LR of 10^6 and an LR of 10^10 might both receive the label 'extremely strong support', but in a Bayesian sense they shift the prior odds by very different amounts. This compression at the high end is a structural property of any scale with finite levels covering a range that spans fifteen or more orders of magnitude.

Proponents of verbal scales respond that the alternative, reporting only the raw LR, creates its own problems. Courts have been shown to overweight large numbers when they appear in evidence, treating an LR of 10^15 as qualitatively different from an LR of 10^10 even when the difference is operationally irrelevant. A verbal scale, when properly explained, provides a controlled simplification that is more honest about the limits of the underlying model than a precise-looking raw figure derived from population databases with their own assumptions and uncertainties.

LR scales across disciplines and jurisdictions

Verbal equivalence scales originated in DNA evidence and have since spread to other forensic disciplines including fingerprint comparison, handwriting analysis, toolmark examination, and voice comparison. The application raises specific problems in each discipline. For DNA, the LR can often be calculated from population databases with known statistical properties. For fingermarks or handwriting, the LR must frequently be estimated from examiner experience or from small reference databases, and the uncertainty around the estimate is larger. Using the same verbal scale across disciplines with different underlying uncertainty levels can be misleading.

Jurisdictional variation is substantial. Courts in England and Wales operate under the guidance of the Forensic Science Regulator, whose Codes of Practice require evaluative reporting to follow the LR framework and to use verbal scales consistent with the AFSP or equivalent standards. In India, evaluative reporting using the Bayesian LR framework is not yet standard in most laboratory practice, though academic and regulatory discussions are active following the Bharatiya Sakshya Adhiniyam 2023, which continues to treat expert opinion as a category of evidence without specifying a reporting framework. In the United States, the President's Council of Advisors on Science and Technology (PCAST) 2016 report on forensic science recommended that probabilistic reporting frameworks be adopted more widely, but implementation remains uneven across federal and state jurisdictions. In the European Union, the ENFSI guideline is the primary reference across member states, though national accreditation requirements vary.

The result is that a forensic scientist who works in international cases may need to adapt their reporting to multiple sets of requirements. A report produced for a Dutch court under the Netherlands Forensic Institute's scale may need to be reformatted for use in an Australian court under different guidance. The propositions framework, the LR logic, and the direction of the verbal phrase are transferable; the specific verbal labels and boundary values are not.

Best practice for evaluative reporting with scales

The current consensus among forensic standards bodies is that a verbal scale must always be accompanied by the following elements: a statement of the prosecution and defence propositions at the appropriate level, a statement of the conditioning assumptions, identification of the scale being used, and an explanation of what the verbal phrase means in terms of the LR framework. These requirements appear in the ENFSI guideline, the UK Forensic Science Regulator's Codes of Practice, and equivalent documents from forensic science bodies in Australia and North America.

In practice, many laboratories include a standard explanatory paragraph in all evaluative reports. The paragraph explains that the scientist has calculated or estimated an LR, that the LR measures the probability of the evidence under two competing hypotheses, that the verbal phrase is derived from an agreed scale, and that the verbal phrase does not represent a probability of guilt. This boilerplate approach has been criticised as a procedural gesture rather than genuine communication, but it does ensure that the explanatory text is present in the record even if it is not read carefully.

Some scientists and courts have moved toward reporting the numerical LR alongside the verbal phrase, on the grounds that this provides the trier of fact with more information and makes the mapping transparent. The ENFSI guideline does not prohibit reporting the numerical LR but does not require it either. The UK Forensic Science Regulator's guidance is that numerical values may be reported where they can be derived and explained, but that the verbal phrase should always accompany them. The emerging consensus is that transparency about the number, the scale, and the assumptions is better than concealing any of them.

Worked example

Applying a verbal scale to a glass fragment LR

A forensic scientist has compared glass fragments recovered from a suspect's jacket with glass from a broken shop window. The analysis produces a likelihood ratio. This example traces how that LR is converted to a verbal phrase, and what the scientist must include in the report.

The scientist's propositions are: H1 (prosecution): the glass on the jacket came from the broken window; H2 (defence): the glass on the jacket came from another source. The scientist estimates the LR using refractive index measurements and a relevant database of background glass distributions. The resulting LR is 850, meaning the evidence is 850 times more probable under H1 than under H2.

Locate the LR on the scale. log10(850) is approximately 2.93. Under the AFSP seven-level scale, a log10(LR) between 2 and 4 corresponds to 'strong support'. Under the Jeffreys scale, any log10(LR) above 2 is 'decisive'.
Choose the scale. The laboratory uses the AFSP scale. The scientist records this in the report and attaches the scale definition as an appendix or cites the published reference.
Draft the evaluative statement. 'I find strong support for the proposition that the glass fragments on the jacket originated from the broken window, rather than the proposition that they originated from another source.'
Add the framework explanation. The report includes a standard paragraph explaining that the verbal phrase reflects an LR of approximately 850, that this LR measures the relative probability of the evidence under the two stated propositions, and that it does not represent a probability of guilt or innocence.
Note the limiting assumptions. The scientist states that the LR is conditioned on the assumption that the background glass population in the relevant database is representative of the area where the offence occurred, and notes that if this assumption is incorrect the LR may need to be revised.
Cross-check against alternative scales. Had the scientist used the Jeffreys scale, the same LR would be described as 'decisive'. Had they used a more conservative laboratory scale with a boundary at log10(LR) = 4 for 'strong support', the same LR would be described only as 'moderate support'. This comparison illustrates why the scale must be identified in the report.

Check your understanding

Question 1 of 4· 0 answered

A forensic scientist calculates an LR of 500 for a fibre comparison. Using the Jeffreys scale, log10(500) is approximately 2.7. Which Jeffreys label applies?

Key Takeaways

A verbal equivalence scale maps a range of LR values to a phrase such as 'strong support', allowing forensic scientists to communicate evidential strength in language accessible to courts without abandoning the underlying probabilistic framework.
The Jeffreys scale, based on log10 of the Bayes factor, has five levels running from 'barely worth mentioning' to 'decisive', and is the historical reference point for most forensic scales, though no laboratory should apply it without considering whether its boundaries are appropriate for their discipline and jurisdiction.
The ENFSI guideline and the AFSP scale are the primary reference standards for evaluative reporting in Europe and the UK respectively. Neither mandates fixed numerical boundaries; laboratories must define and justify their own cut-points.
The main criticisms of verbal scales are that boundary values are arbitrary, that they risk the transposition fallacy, and that they compress very different LR values into a single label. These are genuine limitations, not merely theoretical concerns.
Best practice requires that any verbal phrase in a report be accompanied by the propositions, the conditioning assumptions, identification of the scale used, and an explanation of what the LR framework means so that the trier of fact can interpret the phrase correctly.

What is a likelihood ratio scale in forensic science?

A likelihood ratio scale maps numerical LR values to verbal phrases such as 'moderate support' or 'very strong support'. The scale lets a forensic scientist communicate the strength of evidence in plain language while remaining anchored to a calculated or estimated probability ratio. Different laboratories and standards bodies use different scale boundaries, so the verbal phrase alone cannot be interpreted without knowing which scale was applied.

What is the Jeffreys scale and is it still used in forensic science?

Harold Jeffreys proposed a logarithmic scale for interpreting Bayes factors in his 1961 book Theory of Probability. The scale assigns verbal labels to ranges of log10(LR), running from 'barely worth mentioning' at LR values just above 1 up to 'decisive' at log10(LR) greater than 2. Forensic scientists borrowed the Jeffreys scale as a convenient reference, but most forensic standards bodies now recommend laboratory-specific scales with defined boundaries that have been agreed upon and validated, rather than applying Jeffreys directly.

What verbal equivalent scale does ENFSI recommend?

The European Network of Forensic Science Institutes (ENFSI) guideline on evaluative reporting recommends a seven-level scale running from 'limited support' at LR values just above 1 through 'moderate support', 'moderately strong support', 'strong support', 'very strong support', and 'extremely strong support', to 'overwhelmingly strong support' at very high LR values. The guideline explicitly states that the boundaries between levels are not fixed and that laboratories must define and justify their own cut-points.

Why do different laboratories use different LR scale boundaries?

There is no mathematically derived reason why, for example, log10(LR) = 2 should mark a boundary between 'moderate' and 'strong' support. The boundaries are expert consensus choices, and different expert communities have converged on slightly different positions. Laboratories are expected to define their own scales based on their discipline, their validation data, and any applicable national or accreditation standards. The Association of Forensic Science Providers in the UK and the ENFSI guideline both acknowledge this variability.

Can a verbal equivalence scale mislead a jury?

Yes, and this is a documented concern in the scientific literature. A phrase such as 'strong support for the prosecution hypothesis' can be misread as a statement about guilt rather than a statement about the relative probability of the evidence under two competing hypotheses. Courts in England and Wales, Australia, and elsewhere have encountered difficulties when evaluative statements were not explained in terms of the underlying LR framework. Several national forensic science bodies now require that any verbal scale be accompanied by an explanation of what it means, and some courts have required scientists to provide the underlying numerical basis alongside the verbal phrase.

Test yourself on Forensic Statistics with free, timed mocks.

Practice Forensic Statistics questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Key Takeaways

Your journey to becoming a forensic professional starts here.