Stylometry and Statistical Distance Methods

Stylometry translates authorial habit into numbers and measures the distance between writing samples. This topic covers Burrows's Delta and its variants, multivariate visualisation methods, machine-learning classifiers, and the validation requirements that determine whether stylometric evidence is legally defensible.

Last updated: 19 Jun 2026

Stylometry measures authorship by converting writing samples into frequency vectors and computing the mathematical distance between them. The dominant forensic tool is Burrows's Delta, introduced in 2002, which standardises word frequencies using z-scores before comparing texts so that no single word dominates by volume alone. Variant metrics (Cosine Delta, Eder's Delta, Argamon's Delta) extend this framework to handle short texts, noisy corpora, and non-normal frequency distributions. Admissibility depends not on accuracy in the abstract but on whether the error rate has been estimated honestly on held-out data and whether the method is transparent enough to withstand cross-examination.

Stylometry converts the intuition that writing styles are distinguishable into a measurable quantity. It replaces the subjective impression of 'this sounds like X' with a distance score that a statistician can interpret and a court can scrutinise. The field has a longer history than most people realise: the first computational stylometry study dates to 1887, when T. C. Mendenhall counted word-length distributions in Shakespeare and other authors. His comparison of Shakespeare and Marlowe appeared in a later 1901 article in Popular Science Monthly. Modern tools are far more powerful, but the core operation, measure the text, compute a distance, compare to a reference, has not changed.

This topic focuses on the statistical machinery: Burrows's Delta and its descendants, principal component analysis (PCA) for visualising authorship space, rolling Delta for detecting collaboration and revision, and the machine-learning classifiers that are increasingly applied to attribution. Each method has a different relationship with interpretability, and interpretability is exactly what separates a useful forensic tool from an inadmissible black box.

Running underneath all of it is the validation question. A method that works on a specific historical corpus may fail on a new genre, a new language, or a new time period. The closed-corpus trap, where methods are validated only on data they were also trained on, is a persistent problem in published stylometric research, and it is the first thing a competent expert witness should address when their methodology is challenged. This topic covers both the tools and the honest accounting of where they can go wrong.

By the end of this topic you will be able to:

Explain how Burrows's Delta standardises word frequencies and why z-score normalisation makes the metric more stable than raw counts.
Compare the conditions under which Cosine Delta, Eder's Delta, and Argamon's Delta outperform the original metric.
Describe how rolling Delta is used to detect collaborative authorship or revision within a single document.
Distinguish the admissibility advantages of Delta-based methods from machine-learning classifiers under Daubert and comparable reliability standards.
Identify what the closed-corpus trap is and apply the correct cross-validation strategy to avoid it in a forensic stylometric report.

Key terms

Burrows's Delta: A distance metric introduced by John Burrows in 2002 that measures how far a text's standardised word-frequency profile deviates from the mean profile of a reference corpus. Smaller Delta = closer style.
Cosine Delta: A variant of Burrows's Delta that uses cosine similarity instead of Manhattan distance. Research by Smith and Aldridge (2011) and later Eder, Rybicki, and Kestemont has shown it often outperforms the original Delta on shorter texts and noisy corpora.
Principal component analysis (PCA): A dimensionality-reduction technique that projects a high-dimensional feature space onto a small number of axes capturing maximum variance. Used to visualise clusters of texts in authorship space.
Rolling Delta: Application of the Delta metric to a moving window of text, producing a curve showing how stylistic similarity to each candidate changes across the document. Used for collaboration and revision detection.
Cross-validation: A validation strategy that repeatedly partitions a labelled corpus into training and test subsets to estimate the method's error rate on unseen data. A key safeguard against over-fitting.
Closed-corpus trap: The error of validating a stylometric method exclusively on texts from authors whose profiles were used to build the model, inflating apparent accuracy beyond what is achievable on truly unseen authors.

Burrows's Delta and its variants

John Burrows introduced Delta in a 2002 methodological paper, 'Delta: A Measure of Stylistic Difference and a Guide to Likely Authorship', published in Literary and Linguistic Computing. The algorithm is transparent enough to explain in a few sentences. Take a large reference corpus and extract the n most frequent words (typically 100-500). For each word, compute its mean frequency and standard deviation across the corpus. Standardise each text's word frequencies by subtracting the mean and dividing by the standard deviation, producing a z-score. Delta is then the average absolute difference in z-scores across the feature words, between the questioned text and each candidate. The candidate with the smallest Delta is the closest stylistic match.

The z-score standardisation is what gives Delta its stability. Without it, common words dominate simply because they are common. After standardisation, a word used at twice the corpus average contributes the same to the distance as a rare word used at twice its own average. Each feature contributes to the distance in proportion to its deviation from the corpus norm, regardless of its raw frequency.

Delta variant	Key change from Burrows	Typical advantage
Burrows's Delta (original)	Manhattan distance on z-scores	Stable baseline; well-replicated
Argamon's Delta	Quadratic (Euclidean) distance on z-scores	Penalises large deviations more; better on some corpora
Cosine Delta	Cosine similarity on z-scored vectors	Outperforms on short or noisy texts
Eder's Delta	L1 distance using corpus proportions rather than z-scores	Better performance on non-normal frequency distributions

No single variant dominates across all conditions. Published comparative studies suggest Cosine Delta performs best on short texts and Eder's Delta handles non-Gaussian distributions well. The practical advice for forensic work is to run multiple variants and treat strong cross-variant agreement as a sign of a reliable result. Where variants disagree, that disagreement is evidence to report, not to hide.

Burrows's Delta pipeline: raw word-frequency counts are z-score normalised so each feature contributes proportionally to its deviation, then Manhattan distances are summed and averaged across all features to rank candidates. The candidate with the smallest Delta is the closest stylistic match.

PCA and cluster analysis for visualising authorship

Delta scores are numbers. PCA turns them into a picture. Principal component analysis takes the high-dimensional feature vectors for a set of texts and projects them onto two or three axes that capture the most variance in the data. Texts by the same author cluster together; texts by different authors sit apart. A plot of the first two principal components is standard in published stylometric studies because it shows the authorship structure of a corpus at a glance.

Cluster analysis is a complementary approach. Hierarchical clustering groups texts by similarity and displays the result as a dendrogram, a tree where closely related texts branch off together. Both PCA plots and dendrograms are valuable in court because they make the distance structure visible to a non-specialist audience. They do not, by themselves, assign authorship; they show where the questioned document sits relative to candidates in a feature space the expert has constructed.

Schematic PCA scatterplot: three author clusters with questioned document Q near Candidate A.

Rolling Delta and moving-window methods

A single Delta score for a document treats the entire text as a uniform block. Rolling Delta relaxes that assumption. The method slides a window of fixed size (commonly 1,000-5,000 words) across the text and computes the Delta to each candidate at each window position. The output is a graph: the x-axis is the position in the document; the y-axis is the Delta distance to each candidate. Where one candidate's line drops while another's rises, the method is flagging a stylistic shift.

This is useful for two forensic scenarios. The first is collaborative authorship: a document attributed to one person may turn out to show a clear stylistic boundary mid-way through, consistent with a second author drafting the later sections. The second is revision or ghost-writing: a base text written by one person and substantially revised by another can show a distinctive layering of styles that a single global score would obscure. In literary scholarship, rolling Delta was used to probe the joint authorship of plays attributed to Shakespeare and Fletcher.

Machine-learning classifiers and admissibility

Support vector machines (SVMs), random forests, and neural networks consistently outperform Delta methods on laboratory benchmarks, often by a wide margin on large candidate sets. The question for forensic application is not accuracy in the abstract; it is whether the accuracy estimate is reliable for the specific case conditions (text length, number of candidates, genre, language), and whether the method is interpretable enough for a court to scrutinise.

Method	Accuracy (typical large-scale benchmark)	Interpretability	Forensic deployment status
Burrows's Delta	70-85%	High: distances between transparent feature vectors	Widely used in practice and case reports
SVM (linear kernel)	80-92%	Moderate: feature weights are inspectable	Used in some forensic casework with feature-weight reporting
Random forest	82-90%	Moderate: permutation importance scores available	Used in research; limited forensic case deployment
Neural network (LSTM/transformer)	88-95%	Low: attention maps are approximate, not definitive	Research stage; very limited forensic use due to interpretability gap

The admissibility debate is live in multiple jurisdictions. U.S. federal courts applying the Daubert standard require that a method be testable, have a known error rate, be peer-reviewed, and be generally accepted in the relevant scientific community. Delta satisfies these criteria better than most neural approaches. UK courts under the reliability standard established in R v Dlugosz [2013] EWCA Crim 2 similarly weigh method transparency heavily. The practical position most forensic linguists adopt is: use machine-learning methods to form an initial view, then confirm with a Delta-based method that can be explained transparently in the report and withstand cross-examination.

Validation: error rates and the closed-corpus trap

Stylometric methods can produce impressive numbers in controlled settings. The question is whether those numbers are honest estimates of performance on the kind of text the forensic case involves. Two practices undermine honest validation more than any others: over-fitting to a training corpus, and validating on data that overlaps with training data.

The closed-corpus trap is specifically the problem of validating a method only on texts from the same set of authors used to build the feature model. In that setting, the method has, in effect, seen the answer before the test. Reliable validation requires held-out authors, texts written at different times or in different registers, and ideally comparison against a population of texts from writers not in the candidate set, to establish the rate at which the method assigns a questioned text to the wrong person when the true author is absent.

k-fold cross-validation
Divide the labelled corpus into k equal folds. Train on k-1 folds and test on the held-out fold. Repeat k times and average the error rates. For forensic purposes, k=10 is a common choice.
Leave-one-out cross-validation
A special case where k equals the number of texts. Each text is tested exactly once as a held-out item. Maximises the training data used but is computationally expensive on large corpora.
Imposter method
A specific open-set validation technique that introduces 'impostor' texts from outside the candidate set. The method must reject these as not matching any candidate. This directly tests the open-set problem that real forensic cases face.

Worked example

Stylometric analysis of a disputed corporate document

Applying Delta, PCA, and cross-validation to a short authorship dispute.

A disputed internal report is attributed to two different senior analysts at a financial institution, the authorship having legal implications for a compliance hearing. Each analyst has an archive of authored reports totalling roughly 15,000 words each, and the disputed report is 2,800 words.

Feature construction: 200 most frequent words across the combined 30,000-word reference corpus, yielding 200 function and near-function words.
Delta comparison: Cosine Delta between the disputed report and each analyst's corpus centroid. Analyst B shows a substantially smaller distance (0.42 versus 0.91 for Analyst A).
PCA visualisation: projection of all reference segments and the disputed report onto the first two principal components. The disputed report sits within the cluster of Analyst B's documents.
Cross-validation: leave-one-out CV on 140 reference segments (70 per analyst) gives 91 per cent correct attribution at this text length. The 9 per cent error rate is explicitly stated in the report.
Conclusion reported: the disputed document is more consistent with Analyst B's writing style than with Analyst A's across multiple analyses. The estimated error rate of 9 per cent means there is meaningful but not eliminating uncertainty.

The convergence across Delta, PCA, and cross-validation makes the conclusion more defensible than any single metric alone. The stated error rate keeps the report honest and courtroom-safe. This is the model a well-executed stylometric analysis follows.

Check your understanding

Question 1 of 4· 0 answered

What does a smaller Burrows's Delta score between a questioned text and a candidate indicate?

Key Takeaways

Burrows's Delta standardises word frequencies by z-score before computing distance, giving each feature an equal vote proportional to deviation from the corpus norm.
Variant metrics (Cosine Delta, Eder's Delta, Argamon) outperform the original in specific conditions; running multiple variants and checking agreement is more defensible than relying on a single metric.
PCA and hierarchical clustering visualise the authorship structure of a corpus, making it possible to show a court where the questioned document sits relative to candidates without requiring statistical expertise.
Rolling Delta detects stylistic shifts within a single document, supporting collaboration or revision hypotheses that a global attribution score cannot reveal.
Machine-learning classifiers achieve higher accuracy benchmarks than Delta but are harder to explain in court; transparent Delta methods remain the standard for evidential reporting in most jurisdictions.
Honest validation requires cross-validation on held-out authors and texts; methods validated only on their own training authors (the closed-corpus trap) overstate real-world performance.

What is Burrows's Delta and how does it measure authorship?

Burrows's Delta is a distance metric that compares how far each text's standardised word-frequency vector lies from the mean of a reference corpus. A smaller Delta score between a questioned document and a candidate's texts means the documents are closer in stylistic space. The method was introduced by John Burrows in 2002 and has become one of the most widely replicated attribution tools in literary and forensic stylometry.

What is rolling Delta used for?

Rolling Delta applies the Delta distance calculation to a moving window of text, producing a graph of stylistic similarity over the course of a document. This is useful for detecting collaboration, where one author's voice gives way to another's mid-text, or revision, where a later editor's style can be distinguished from the original.

Why are machine-learning classifiers harder to admit as evidence than Delta methods?

Neural networks and ensemble methods are often opaque: they do not produce an interpretable account of which features drove the decision. This 'black box' quality makes them difficult to cross-examine in court. Delta-based methods, by contrast, produce a distance between transparent feature vectors that an expert can walk a court through step by step.

What is the closed-corpus trap in stylometry validation?

The closed-corpus trap occurs when a method is validated only on texts from authors already in the training data. The method may perform well because it has essentially memorised the corpus rather than learning generalisable style distinctions. A properly validated stylometric tool is tested on authors and texts not seen during training.

What does cross-validation tell you about a stylometric method?

Cross-validation repeatedly holds out a subset of the known texts, trains the model on the rest, and tests whether it correctly identifies the held-out texts. The average error rate across folds is an honest estimate of the method's performance on unseen text from the same population, and it is the figure that should accompany any forensic attribution claim.

Test yourself on Forensic Linguistics with free, timed mocks.

Practice Forensic Linguistics questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Key Takeaways

Your journey to becoming a forensic professional starts here.