Skip to content

Variability and Measurement Error

Forensic measurements are never perfectly reproducible, and the spread of repeated readings must be characterised before any comparison can be interpreted. This topic distinguishes within-class and between-class variability from instrument error, and introduces precision, accuracy, and repeatability as distinct, independently quantifiable properties.

Last updated:

Share

Every physical measurement in forensic science contains two separate sources of spread: the natural variability of the material or population being measured, and the error introduced by the instrument, the analyst, and the procedure. Before any forensic comparison can be interpreted, both contributions must be characterised and kept distinct. The spread among items from the same source is called within-class variability. The spread between items from different sources is called between-class variability. Instrument error is the discrepancy between a recorded value and the true value of the property, and it is an artefact of the measurement process, not of the material. Evidence is most informative when between-class variability is large relative to the combined total of within-class variability and measurement error, because only then can the analyst reliably distinguish items of different origins from items of the same origin.

Precision, accuracy, and repeatability are three distinct properties of a measurement system, and each must be evaluated separately. Precision describes how closely repeated measurements of the same specimen agree with each other. Accuracy describes how closely a measurement agrees with the accepted true or reference value. Repeatability is the agreement between successive measurements made under identical conditions by the same analyst on the same day. A method can be precise without being accurate, accurate without being precise, or both. Forensic laboratories are required by accreditation standards, such as ISO/IEC 17025, to document all three properties for each quantitative method before results from that method are reported in casework.

The practical importance of this framework is clearest when a comparison falls in a grey zone: two measurements that are close but not identical. Without knowing the magnitude of measurement error for the method and the natural spread of within-class variation for the population in question, the analyst cannot determine whether the difference is consistent with a common origin, inconsistent with a common origin, or simply uninformative. This question has been central to disputes over glass evidence, soil comparison, fibre analysis, and many other trace disciplines in courts across the United States, the United Kingdom, India, and the European Union. The concepts introduced in this topic underpin all of those debates.

By the end of this topic you will be able to:

  • Define within-class and between-class variability and explain how their ratio determines the discriminating power of a forensic measurement.
  • Distinguish measurement error from natural variability and identify the two as separate, independently quantifiable contributions to observed spread.
  • Explain the difference between precision and accuracy, and describe scenarios in which a method can have one property without the other.
  • Define repeatability and reproducibility, explain how they are measured, and interpret typical within-laboratory validation data.
  • Apply knowledge of measurement error to judge whether a numerical difference between two forensic specimens is meaningful, uninformative, or below the method detection threshold.
Key terms
Within-class variability
The natural spread of measurements among items that share a common source, class, or category. For example, the range of refractive index values measured across different fragments from the same broken glass pane. Also called intra-class variability.
Between-class variability
The spread of measurements between items from different sources or categories. A forensic method is most discriminating when between-class variability is large relative to within-class variability, so that same-source items cluster together and different-source items separate clearly.
Measurement error
The discrepancy between a recorded value and the true value of the property being measured. Arises from instrument calibration limits, analyst technique, environmental conditions, and procedure-specific factors. Distinct from natural variability in the material itself.
Precision
The degree to which repeated measurements of the same specimen under the same conditions agree with each other. Quantified by the standard deviation or coefficient of variation of replicate measurements. A precise method has low scatter regardless of whether it is centred on the true value.
Accuracy
The degree to which a measured value agrees with the accepted true or reference value of the quantity. A biased instrument can produce precise but inaccurate results. Accuracy is assessed using certified reference materials or standards with known values.
Repeatability
The agreement between successive measurements of the same specimen made by the same analyst, on the same instrument, in the same laboratory, within a short time. Sets the minimum floor of variability for a method. Reproducibility extends this to changed conditions such as different analysts or laboratories.

Within-class and between-class variability

Consider a forensic scientist comparing a glass fragment recovered from a suspect's clothing with glass fragments from a broken window at a scene. Both sets of fragments are measured for refractive index. The question is whether the two sets could share a common origin. To answer it, the scientist needs two reference distributions: how much do refractive index values vary among fragments from the same pane, and how much do they vary between panes of different origin?

The first distribution is the within-class variability. Even fragments broken from the same pane will not all have exactly the same refractive index, because glass composition is not perfectly uniform and because measurement itself introduces spread. The second distribution is the between-class variability. Different panes, made at different times or factories, will tend to have different mean refractive index values. A forensic method is informative only when the between-class distribution is wider than the within-class distribution. If all panes had essentially the same refractive index, measuring it would discriminate nothing.

The same framework applies across forensic disciplines. In soil comparison, within-class variability is the natural heterogeneity within a small area of a single location; between-class variability is the difference between locations. In fibre analysis, within-class variability covers variation in dye absorption among fibres from the same garment; between-class variability covers the distribution of fibre types across different garments. In DNA profiling, within-class variability is the measurement noise around allele peaks; between-class variability is the population-level frequency of each allele. The mathematical structure is the same in every case.

DisciplineWithin-class variabilityBetween-class variability
Glass (refractive index)Spread across fragments from one paneSpread across panes from different sources
Soil comparisonNatural heterogeneity within one locationDifferences between separate locations
Fibre analysisDye absorption variation within one garmentColour and fibre-type range across garments
DNA profilingMeasurement noise around allele peaksAllele frequency distribution across the population
Handwriting metricsLetter-size variation within one writer's samplesMetric differences between different writers

Measurement error: sources and separation

Measurement error is not the same as variability in the material being studied. It is a property of the measurement system: the instrument, the analyst, and the procedure. When a refractive index measurement is replicated ten times on the same fragment without moving it from the instrument stage, the spread of those ten values reflects only measurement error. The fragment has not changed; any variation in the numbers comes from instrument noise, temperature fluctuation, focus variability, and other procedural factors.

The sources of measurement error fall into three broad categories. Random error is unpredictable fluctuation that causes individual readings to scatter above and below the true value. Averaging many replicates reduces the influence of random error on the reported result. Systematic error, or bias, shifts all readings in the same direction relative to the true value. It cannot be reduced by replication; it requires recalibration or correction. Method-specific error arises from the measurement protocol itself, for example incomplete digestion before elemental analysis, or air bubbles trapped in a mounting medium for refractive index work.

Once both components are estimated, the total observed variance in a set of casework measurements can be written as the sum of the variance due to true material variation and the variance due to measurement error. If measurement error accounts for most of the observed spread, improving the instrument or protocol will sharpen the discrimination. If true material variation dominates, better instrumentation will not help; the discrimination limit is set by the biology or chemistry of the material itself.

Precision and accuracy as distinct properties

Precision and accuracy are often confused in casual language but they are operationally independent. Precision describes the spread of replicate measurements around their own mean. Accuracy describes the distance of that mean from the true value. The classic illustration is a target: a precise shooter puts all shots in a tight cluster; an accurate shooter puts shots close to the centre; an ideal shooter achieves both. A systematically miscalibrated instrument produces a tight cluster far from the true value, which is precise but inaccurate.

In forensic practice, precision is quantified by running replicate measurements on the same specimen and computing the standard deviation or the coefficient of variation (the standard deviation divided by the mean, expressed as a percentage). Accuracy is assessed by measuring a certified reference material, a substance with a known accepted value, and comparing the result to that value. The discrepancy is the bias. ISO/IEC 17025, the international accreditation standard applied to forensic laboratories in most jurisdictions including those operating under guidelines from the UK Forensic Science Regulator, the US OSAC, and India's National Accreditation Board for Testing and Calibration Laboratories (NABL), requires that both be documented.

The practical consequence for evidence interpretation is this: if a method has a known bias, the reported value should be corrected before comparison. If a method has low precision, the uncertainty interval around each measurement is wide, and small numerical differences between two specimens may fall within that interval and be uninformative. Courts in England and Wales, under the guidance of the Forensic Science Regulator, and US federal courts under Daubert and Rule 702, increasingly require that the known or potential error rate of a method be disclosed. The same expectation is emerging in Indian courts, particularly following judicial commentary on forensic method reliability in evidence law proceedings under the Bharatiya Sakshya Adhiniyam 2023.

Repeatability and reproducibility

Repeatability and reproducibility are formally defined by the International Organization for Standardization in ISO 5725. Repeatability is the closeness of agreement between successive results obtained with the same method on identical test material, under the same conditions: same operator, same apparatus, same laboratory, over a short period. Reproducibility is the same closeness of agreement when conditions are changed: different operator, different apparatus, or different laboratory.

In practice, repeatability is measured by a single analyst running multiple analyses of the same specimen on the same day, sometimes called an intra-run or within-day study. Reproducibility is measured across days, analysts, or laboratories, sometimes called an inter-run or between-day study, or a collaborative trial when multiple laboratories participate. Reproducibility is always equal to or greater than within-laboratory repeatability, because it adds analyst-to-analyst and laboratory-to-laboratory sources of variance on top of the within-analyst sources.

For forensic casework, the reproducibility figure is more relevant than repeatability, because the measurement being compared to a reference may have been made in a different laboratory, by a different analyst, or on a different day. If a scientist compares a measurement from a questioned specimen made today with a reference value measured six months ago, the relevant uncertainty is the reproducibility of the method, not its short-term repeatability. Using repeatability figures when reproducibility applies is a common source of overconfidence in forensic comparisons.

The measurement interval and the comparison problem

When a forensic analyst compares two measurements, one from a questioned specimen and one from a reference specimen, the question is whether the numerical difference between them is consistent with a common origin or inconsistent with it. Three regions exist. If the difference is less than the measurement error of the method, the comparison is uninformative: the difference cannot be distinguished from noise. If the difference exceeds measurement error but is still within the within-class variability of the relevant population, the result is consistent with common origin but cannot exclude a different origin either. If the difference exceeds the within-class variability, the result is inconsistent with common origin.

This three-zone model is a simplification, and formal evidence evaluation using likelihood ratios, covered in the topic on role of statistics in evidence evaluation, handles the continuum more rigorously. But the three-zone framing is useful for understanding why analysts must know the measurement error floor before making any claim about a difference. A difference of 0.0002 in refractive index is meaningful if the method's measurement error is 0.00005 and the within-class spread for float glass is 0.0003. It is uninformative if the method's measurement error is 0.0003.

Several forensic disciplines have historically operated without formal quantification of either measurement error or within-class variability for the populations relevant to their casework. Pattern comparison disciplines, such as footwear mark examination and toolmark analysis, have faced particular scrutiny from the US National Commission on Forensic Science and from reports by the National Academy of Sciences (2009) and the President's Council of Advisors on Science and Technology (2016), both of which called for empirical studies of within-class and between-class variability before comparison conclusions are presented in court. The same concern applies in other common-law systems, including those of the UK, Australia, and India.

Validation studies and accreditation requirements

A method validation study characterises how a measurement method performs before it enters casework use. For quantitative methods, validation must include: measurement of precision (repeatability and reproducibility), measurement of accuracy (bias relative to a reference), determination of the limit of detection and the limit of quantification where applicable, and characterisation of the range over which the method gives reliable results. These data constitute the evidence base that supports the method's use in casework.

Accreditation to ISO/IEC 17025 is required for forensic science providers in England and Wales, mandated by the Forensic Science Regulator Act 2021. In the United States, the OSAC (Organization of Scientific Area Committees for Forensic Science) publishes standards for method validation across disciplines. In India, the NABL accredits forensic laboratories to ISO/IEC 17025 under the Department for Promotion of Industry and Internal Trade; accreditation is not yet universally mandated for all court-reporting laboratories, though the Bharatiya Sakshya Adhiniyam 2023 and associated rules are increasing judicial scrutiny of the scientific basis of forensic conclusions. In the European Union, the European Network of Forensic Science Institutes (ENFSI) guidelines align with ISO/IEC 17025 and require documented validation for all quantitative methods.

Ongoing quality control within a laboratory supplements the initial validation. Control charts track whether a method's performance drifts over time. Reference materials are run alongside casework specimens to detect instrument miscalibration before it affects reported results. Proficiency testing, in which analysts analyse blind specimens with known true values, provides an external check on both precision and accuracy. The cumulative record of quality control data is the operational evidence that a method's validated performance has been maintained throughout the period when casework was conducted.

Check your understanding
Question 1 of 4· 0 answered

A forensic scientist measures the diameter of fibres from a suspect's jumper and from fibres found at a scene. Both sets come back with a mean diameter of 18.4 micrometres, but the scene fibres show more spread. What does the greater spread in the scene fibres most likely indicate?

Key Takeaways

  • Within-class variability is the natural spread among items of the same origin; between-class variability is the spread between different origins. A method discriminates best when the ratio of between-class to within-class variability is large.
  • Measurement error and natural variability are separate components of total observed spread, and must be estimated independently: error from replication of a single specimen, and natural variability from measurements across many specimens from the same class.
  • Precision (low scatter among replicates) and accuracy (closeness to the true value) are independent properties; a biased but repeatable instrument is precise without being accurate, and both must be documented in method validation.
  • Repeatability covers same-analyst, same-day variation; reproducibility covers changed conditions across analysts or laboratories. The reproducibility figure, not repeatability, is relevant when comparing measurements made in different contexts.
  • A numerical difference between two forensic specimens is interpretable only after comparing it to the measurement error of the method and the within-class variability of the relevant population; without both reference values, no comparison conclusion is defensible.
What is the difference between within-class and between-class variability in forensic science?
Within-class variability is the natural spread of measurements among items that belong to the same source or category, for example the range of refractive index values across fragments from a single broken pane. Between-class variability is the spread between different sources or categories. Evidence is informative when between-class variability is large relative to within-class variability, because that means two measurements from the same source will tend to be closer together than two measurements from different sources.
What is measurement error and how does it differ from natural variability?
Measurement error is the discrepancy between a recorded value and the true value of a property, arising from the instrument, the analyst, or the procedure. Natural variability is an inherent property of the material or population being studied. The two are separate contributions to the total observed spread in a dataset. A precise instrument can reveal true natural variability clearly; an imprecise instrument blurs it. Characterising both separately is a prerequisite for sound evidence interpretation.
What is the difference between precision and accuracy?
Precision is the degree to which repeated measurements of the same specimen agree with each other, regardless of whether they are close to the true value. Accuracy is the degree to which a measurement agrees with the accepted true or reference value. An instrument can be precise but inaccurate (consistently biased), accurate but imprecise (centred on the true value but with wide scatter), or both. Forensic validation studies must evaluate both properties independently.
What is repeatability and how does it relate to reproducibility?
Repeatability is the agreement between successive measurements made on the same specimen under the same conditions: same instrument, same analyst, same laboratory, over a short time. Reproducibility is the agreement between measurements made under changed conditions, such as a different analyst, laboratory, or day. Repeatability sets a lower bound on variability; reproducibility is typically larger and reflects how results will differ across casework scenarios.
Why must measurement error be quantified before comparing two forensic measurements?
If the difference between two measurements is smaller than the measurement error of the method, it cannot be concluded that the two specimens differ. Conversely, if the difference exceeds what would be expected from error alone, the analyst must decide whether it exceeds what would also be expected from within-class natural variability. Without knowing the magnitude of error, neither comparison is defensible, and courts in the US, UK, and India have increasingly required that error rates be documented for forensic methods.

Test yourself on Forensic Statistics with free, timed mocks.

Practice Forensic Statistics questions

Found this useful? Pass it along.

Share

Spotted an error in this page? Report a correction or read our editorial standards.

Your journey to becoming a forensic professional starts here.

Practice with mock tests, learn from structured notes, and get your questions answered by a global forensic community, all in one place.