Practice with mock tests, learn from structured notes, and get your questions answered by a global forensic community, all in one place.
Stylometry translates authorial habit into numbers and measures the distance between writing samples. This topic covers Burrows's Delta and its variants, multivariate visualisation methods, machine-learning classifiers, and the validation requirements that determine whether stylometric evidence is legally defensible.
Last updated:
You can recognise a friend's writing style even in a short text. Stylometry is the attempt to do the same thing systematically, at scale, with documented methods and quantified uncertainty. It turns the subjective impression of 'this sounds like X' into a distance score that a statistician can interpret and a court can scrutinise. The field has a longer history than most people realise: the first computational stylometry study dates to 1887, when T. C. Mendenhall counted word-length distributions in Shakespeare and Marlowe. Modern tools are far more powerful, but the core operation, measure the text, compute a distance, compare to a reference, has not changed.
This topic focuses on the statistical machinery: Burrows's Delta and its descendants, principal component analysis (PCA) for visualising authorship space, rolling Delta for detecting collaboration and revision, and the machine-learning classifiers that are increasingly applied to attribution. Each method has a different relationship with interpretability, and interpretability is exactly what separates a useful forensic tool from an inadmissible black box.
Running underneath all of it is the validation question. A method that works on a specific historical corpus may fail on a new genre, a new language, or a new time period. The closed-corpus trap, where methods are validated only on data they were also trained on, is a persistent problem in published stylometric research, and it is the first thing a competent expert witness should address when their methodology is challenged. This topic covers both the tools and the honest accounting of where they can go wrong.
A simple idea, a stable performance record, and a family of refinements.
John Burrows introduced Delta in a 2002 paper on Jane Austen attribution. The algorithm is transparent enough to explain in a few sentences. Take a large reference corpus and extract the n most frequent words (typically 100-500). For each word, compute its mean frequency and standard deviation across the corpus. Standardise each text's word frequencies by subtracting the mean and dividing by the standard deviation, producing a z-score. Delta is then the average absolute difference in z-scores across the feature words, between the questioned text and each candidate. The candidate with the smallest Delta is the closest stylistic match.
The z-score standardisation is what gives Delta its stability. Without it, common words dominate simply because they are common. After standardisation, a word used at twice the corpus average contributes the same to the distance as a rare word used at twice its own average. Every feature gets a fair vote proportional to its deviation from the norm.
| Delta variant | Key change from Burrows | Typical advantage |
|---|---|---|
| Burrows's Delta (original) | Manhattan distance on z-scores | Stable baseline; well-replicated |
| Argamon's Delta | Quadratic (Euclidean) distance on z-scores | Penalises large deviations more; better on some corpora |
| Cosine Delta | Cosine similarity on z-scored vectors | Outperforms on short or noisy texts |
| Eder's Delta | L1 distance using corpus proportions rather than z-scores | Better performance on non-normal frequency distributions |
No single variant dominates across all conditions. Published comparative studies suggest Cosine Delta performs best on short texts and Eder's Delta handles non-Gaussian distributions well. The practical advice for forensic work is to run multiple variants and treat strong cross-variant agreement as a sign of a reliable result. Where variants disagree, that disagreement is evidence to report, not to hide.
Turning a 300-dimensional feature space into a picture a judge can follow.
Delta scores are numbers. PCA turns them into a picture. Principal component analysis takes the high-dimensional feature vectors for a set of texts and projects them onto two or three axes that capture the most variance in the data. Texts by the same author cluster together; texts by different authors sit apart. A plot of the first two principal components is standard in published stylometric studies because it shows the authorship structure of a corpus at a glance.
Cluster analysis is a complementary approach. Hierarchical clustering groups texts by similarity and displays the result as a dendrogram, a tree where closely related texts branch off together. Both PCA plots and dendrograms are valuable in court because they make the distance structure visible to a non-specialist audience. They do not, by themselves, assign authorship; they show where the questioned document sits relative to candidates in a feature space the expert has constructed.
Stylometry can also read change over the course of a single text.
A single Delta score for a document treats the entire text as a uniform block. Rolling Delta relaxes that assumption. The method slides a window of fixed size (commonly 1,000-5,000 words) across the text and computes the Delta to each candidate at each window position. The output is a graph: the x-axis is the position in the document; the y-axis is the Delta distance to each candidate. Where one candidate's line drops while another's rises, the method is flagging a stylistic shift.
This is useful for two forensic scenarios. The first is collaborative authorship: a document attributed to one person may turn out to show a clear stylistic boundary mid-way through, consistent with a second author drafting the later sections. The second is revision or ghost-writing: a base text written by one person and substantially revised by another can show a distinctive layering of styles that a single global score would obscure. In literary scholarship, rolling Delta was used to probe the joint authorship of plays attributed to Shakespeare and Fletcher.
Better accuracy, harder to explain, and that trade-off defines the forensic debate.
Support vector machines (SVMs), random forests, and neural networks consistently outperform Delta methods on laboratory benchmarks, often by a wide margin on large candidate sets. The question for forensic application is not accuracy in the abstract; it is whether the accuracy estimate is reliable for the specific case conditions (text length, number of candidates, genre, language), and whether the method is interpretable enough for a court to scrutinise.
| Method | Accuracy (typical large-scale benchmark) | Interpretability | Forensic deployment status |
|---|---|---|---|
| Burrows's Delta | 70-85% | High: distances between transparent feature vectors | Widely used in practice and case reports |
| SVM (linear kernel) | 80-92% | Moderate: feature weights are inspectable | Used in some forensic casework with feature-weight reporting |
| Random forest | 82-90% | Moderate: permutation importance scores available | Used in research; limited forensic case deployment |
| Neural network (LSTM/transformer) | 88-95% | Low: attention maps are approximate, not definitive | Research stage; very limited forensic use due to interpretability gap |
The admissibility debate is live in multiple jurisdictions. U.S. federal courts applying the Daubert standard require that a method be testable, have a known error rate, be peer-reviewed, and be generally accepted in the relevant scientific community. Delta satisfies these criteria better than most neural approaches. UK courts under R v Bonython reliability standards similarly weigh method transparency heavily. The practical position most forensic linguists adopt is: use machine-learning methods to form an initial view, then confirm with a Delta-based method that can be explained transparently in the report and withstand cross-examination.
An accuracy claim without a validation protocol is not a result, it is a guess.
Stylometric methods can produce impressive numbers in controlled settings. The question is whether those numbers are honest estimates of performance on the kind of text the forensic case involves. Two practices undermine honest validation more than any others: over-fitting to a training corpus, and validating on data that overlaps with training data.
The closed-corpus trap is specifically the problem of validating a method only on texts from the same set of authors used to build the feature model. In that setting, the method has, in effect, seen the answer before the test. Reliable validation requires held-out authors, texts written at different times or in different registers, and ideally comparison against a population of texts from writers not in the candidate set, to establish the rate at which the method assigns a questioned text to the wrong person when the true author is absent.
What does a smaller Burrows's Delta score between a questioned text and a candidate indicate?
Test yourself on Forensic Linguistics with free, timed mocks.
Practice Forensic Linguistics questionsSpotted an error in this page? Report a correction or read our editorial standards.