Plagiarism and Text Reuse: Detection Methods and Evidence

How forensic linguists and computational tools identify verbatim copying, paraphrase, mosaic plagiarism, and translation plagiarism in academic and IP disputes, and what an expert witness adds beyond a similarity score.

Last updated: 19 Jun 2026

Forensic linguists identify text reuse by analysing the type, distinctiveness, and direction of similarity between documents, going well beyond the percentage score produced by detection software. Text reuse takes five main forms: verbatim copying, near-verbatim editing, mosaic (scattered borrowed phrases), paraphrase, and translation plagiarism, each requiring a different detection method and carrying different evidentiary weight. In legal proceedings, the applicable framework matters: academic plagiarism turns on unattributed use of ideas or expression, while copyright infringement requires substantial similarity of protected expression. A forensic expert's role is to interpret the raw similarity data, rule out common-source explanations, and present findings in a form a court can weigh.

A similarity percentage from detection software is not a verdict. A 35% match may reflect shared boilerplate and citations with no substantive copying; an 8% match may contain the entire core argument of the source, paraphrased sentence by sentence. The forensic linguist's task is to determine which is which, and that requires understanding the difference between surface string overlap and genuine textual reuse.

Text reuse comes in several forms that detection algorithms handle very differently. Verbatim copying is the easiest. Mosaic plagiarism, where borrowed phrases are scattered through new prose, is harder. Paraphrase of ideas, without any shared wording, is harder still. And translation plagiarism, where someone copies from a foreign-language source and renders it in a new language, defeats every character-level method entirely. Each form demands a different detection strategy.

This topic maps those detection strategies, explains what each can and cannot prove, and then asks the harder question: what does a forensic expert actually add beyond printing out a software report? The answer is interpretation, context, and the ability to explain a similarity score to a court in a way that is both honest and useful.

By the end of this topic you will be able to:

Distinguish the five categories of text reuse and explain why each requires a different detection strategy
Describe how fingerprinting, vector-space, and neural embedding methods differ in what similarity they can detect
Explain what cross-language plagiarism is and why standard detectors miss it
Contrast the legal standards for academic plagiarism and copyright infringement and calibrate expert analysis accordingly
Identify what a forensic linguist adds beyond software output when presenting similarity evidence in court

Key terms

Verbatim plagiarism: Copying text word-for-word from a source without attribution. The simplest form to detect computationally because character-level matching directly finds the copied string.
Mosaic plagiarism: Borrowing phrases and expressions from a source and embedding them across new sentences, so no single passage is a direct copy but the text is built substantially from another's language.
Paraphrase plagiarism: Reproducing the ideas and structure of a source while replacing most of the wording, making character-level detection ineffective. Requires semantic rather than string-based comparison.
Translation plagiarism: Copying from a source in another language and rendering it in the target language. Defeats monolingual detection entirely and requires cross-lingual semantic models.
Sentence embedding: A neural representation of a sentence as a dense numeric vector, trained so that semantically similar sentences land near each other in vector space regardless of wording. Used for semantic plagiarism and cross-language comparison.
Substantial similarity: The legal standard in copyright infringement: whether the protected expression in one work is reproduced in another to a degree that a reasonable person would recognise the copying. Ideas alone cannot be protected; only the particular expression of ideas.

A taxonomy of text reuse

It helps to start with a clear map before going near detection algorithms, because each type of reuse has a different evidentiary weight and calls for a different expert response.

Reuse type	Key feature	Detection challenge	Evidential weight
Verbatim copying	Exact string match	Minimal; hash or n-gram comparison suffices	High: hard to explain innocently
Near-verbatim (minor edits)	Trivial word swaps, synonym substitution	Low; fuzzy matching or edit-distance handles it	High, with some debate over intent
Mosaic	Scattered borrowed phrases in new prose	Medium; short matches are common in any genre	Medium: depends on phrase uniqueness
Paraphrase	Ideas reproduced, wording changed	High; requires semantic analysis or expert reading	Variable: context and uniqueness matter
Structural	Same argument sequence and section organisation	High; no string evidence at all	Lower: structure alone rarely proves copying
Translation	Cross-language copy	Very high; requires multilingual models	High if the translation is close, hard to prove if loose

A key concept across all these categories is the distinctiveness of the shared material. Finding the phrase 'the data show' in two papers is trivial. Finding the same unusual three-sentence construction and the same idiosyncratic spelling error in both is not. Forensic linguists focus on what is distinctive, rare, and unlikely to arise independently, rather than on the total percentage of matched text.

Computational detection: how the algorithms work

Detection software typically works in three generations of increasing sophistication. Understanding the mechanics explains why a report from any one tool is an incomplete picture.

First generation: fingerprinting
Rabin-Karp and related rolling-hash methods break both texts into overlapping windows of characters or words (n-grams), hash each window to a fingerprint, and store those fingerprints in an index. Matching fingerprints across documents identify shared passages. The method is fast and scales to enormous corpora, but only catches verbatim or near-verbatim copies. A single word change in every sentence defeats it.
Second generation: vector-space models
TF-IDF weighted bag-of-words vectors represent each document as a point in a high-dimensional space, and cosine similarity measures the angle between two points. Documents using the same vocabulary at similar frequencies cluster together. This catches paraphrase that preserves vocabulary but not order, and handles synonym substitution poorly.
Third generation: neural sentence embeddings
Models like BERT, SBERT, and LaBSE encode sentences as dense semantic vectors. Two sentences meaning the same thing in different words land close together; two sentences using the same words with opposite meanings land apart. Cross-lingual models like LaBSE align languages in the same space, making translation plagiarism detectable. These methods are the current best practice for semantic and cross-language similarity, but they require more computational resource and produce scores that are harder to explain to a lay audience.

Detection method capability across reuse types.

Cross-language plagiarism

Cross-language plagiarism is more common than most institutions realise. A researcher reads a Spanish paper, translates the argument into English, and submits it as original work. Because no English string in the output matches the Spanish source, standard detectors give a clean result. The same logic applies to Japanese-to-Chinese copying in scientific publishing, or French-to-Portuguese in academic theses.

The detection approach requires mapping both texts into a shared meaning space. Multilingual sentence encoders, trained on parallel translation corpora, assign similar vectors to sentences that express the same content regardless of language. Once both documents are embedded, cosine similarity across sentence pairs flags candidate matches, which a bilingual reviewer then verifies.

Loose machine translation before comparison is a cheaper alternative, but introduces errors that lower detection precision.
Reference databases for cross-language detection are far smaller than English-only corpora, limiting recall in less-indexed language pairs.
Expert bilingual analysis is still required to distinguish coincidental conceptual overlap from genuine copying, particularly in technical fields where vocabulary is constrained.

Academic plagiarism versus IP infringement

The factual question, 'did this person copy from this source?', looks similar in both academic and legal contexts. The standard of proof and the definition of actionable similarity are different in each, and a forensic linguist must calibrate their analysis accordingly.

Dimension	Academic plagiarism	IP infringement (copyright)
Forum	University disciplinary committee	Civil court; criminal in some jurisdictions
Standard of proof	Balance of probabilities (often lower in practice)	Civil: balance of probabilities; criminal: beyond reasonable doubt
What is protected	Ideas can constitute plagiarism if undisclosed	Ideas are not protected; only specific expression
Threshold for violation	Any unattributed use of another's work	Substantial similarity of protected expression
Who is harmed	Institution, original author, reader	Copyright holder (may differ from author)
Typical remedy	Grade penalty, degree revocation	Injunction, damages, account of profits

The practical implication: a passage paraphrased from a textbook may constitute academic plagiarism even though no copyrightable expression was reproduced. Conversely, a verbatim copy of a factual database entry may not infringe copyright if facts lack originality under the applicable jurisdiction's law. An expert who applies the wrong framework to the wrong forum will confuse rather than assist the decision-maker.

What the forensic linguist adds beyond the software

Courts sometimes question why expert analysis is needed when a software report already shows a similarity figure. The percentage by itself carries little meaning without context. A 42% match may come entirely from a shared legal disclaimer and references section, with the body of the text showing no overlap. A 12% match may come from a single 300-word verbatim block containing the core original argument of the source. The second case is more serious, but the software scores it lower.

Identifying which matched material is distinctive versus common: boilerplate, citations, and genre formulas match across texts innocently.
Assessing directionality: which text came first, based on external evidence and internal features such as errors that were copied forward.
Evaluating the uniqueness of matching passages using corpus frequency data: a rare three-word phrase appearing in both texts has far more weight than a common one.
Ruling out independent creation from a common source: if both parties drew on the same publicly available government report, their texts may overlap without either copying the other.
Explaining what the similarity score means in plain terms that allow the court to weigh it proportionately.

Forensic linguist workflow: from software score to expert opinion.

Presenting similarity evidence in court

The challenge of explaining a similarity score to a lay audience is not merely pedagogical. Courts have excluded or heavily discounted expert testimony that rested on software output without adequate explanation of the method's assumptions, its validated accuracy, and the gap between the raw score and the conclusion.

Under Daubert and Kumho Tire in the United States, a court may ask whether the detection method has a known error rate, has been subjected to peer review, and is generally accepted in the relevant scientific community. Neural sentence embedding for cross-language plagiarism is relatively recent and its validation for forensic purposes is still developing, which an expert must disclose. Older fingerprinting methods have far more documented use and published accuracy data.

Disclose the tool, its version, and the comparison corpus used.
Separate matched passages into meaningful categories: distinctive expression, common genre language, and citations.
Present findings visually where possible: a side-by-side comparison of key passages is often more persuasive and more honest than a number.
State the limits of the analysis explicitly: what the similarity shows, and what it cannot rule out.

Worked example

A PhD thesis and the original journal article

Twelve percent match, one serious problem.

A university research integrity committee refers a doctoral thesis for forensic linguistic analysis after an anonymous complaint. The initial Turnitin report shows 12% overall similarity, which the student argues is unremarkable for a 90,000-word thesis in a technical field with constrained vocabulary. The committee asks whether the 12% warrants further investigation.

The software report is filtered to remove reference-section matches (which account for 7 of the 12 percentage points) and standard methodological phrases that appear in dozens of papers in the field.
The remaining 5% resolves to a single 400-word block in the literature review chapter. The block is compared manually against the flagged source, a 2018 journal article.
The 400-word block is near-verbatim in some sentences and lightly paraphrased in others. Crucially, it contains two idiosyncratic grammatical constructions that appear in the source article but are unusual in the student's own writing elsewhere in the thesis.
The direction of copying is assessed. The journal article was published in 2018; the thesis was submitted in 2021. No plausible scenario under which the journal article copied from the unpublished thesis is identified.
The conclusion is that the 400-word block shows strong evidence of direct copying or close mosaic reuse, even though the overall similarity percentage was low. The committee proceeds to a full hearing.

The case illustrates that a headline similarity percentage can obscure the most significant finding. Reading past the number to the pattern, location, and distinctiveness of the matches is what transforms a software report into usable evidence.

Check your understanding

Question 1 of 4· 0 answered

Why does Rabin-Karp fingerprinting fail to detect paraphrase plagiarism?

Key Takeaways

Text reuse falls into verbatim, mosaic, paraphrase, structural, and translation categories, each requiring a different detection method and carrying different evidentiary weight.
Fingerprinting algorithms like Rabin-Karp handle verbatim copying efficiently; neural sentence embeddings are needed for paraphrase and cross-language detection.
A software similarity percentage is raw data; a forensic linguist's value lies in distinguishing distinctive copied expression from genre boilerplate, assessing directionality, and ruling out common-source explanations.
Academic plagiarism and copyright infringement use the same factual investigation but different legal standards: infringement requires substantial similarity of protected expression, not just unattributed use of ideas.
Expert reports must disclose the tool, corpus, method, and limitations, and should present key matches in side-by-side form so the court can assess them without relying solely on the expert's conclusion.

What is mosaic plagiarism?

Mosaic plagiarism involves taking words and phrases from a source and weaving them into new sentences without quotation marks, so no single passage is copied verbatim but the borrowed material is spread throughout. Software detectors often miss it because individual n-gram matches are short, which is why human analysis matters.

Can plagiarism detection software replace a forensic linguist in court?

No. Software outputs a similarity percentage, but it cannot say whether the similarity is coincidental, explainable by genre conventions, or the result of copying. A forensic linguist analyses the pattern and direction of the similarity and the uniqueness of matching passages. That interpretive work is what courts need from an expert.

How does cross-language plagiarism work and why is it hard to detect?

Cross-language plagiarism copies ideas or text from a source in one language and renders them in another, defeating character-level matching. Detection requires multilingual sentence embeddings that map both texts into a shared semantic space and measure conceptual proximity rather than surface string overlap.

What is the difference between academic plagiarism and IP infringement?

Academic plagiarism is an institutional integrity matter where any unattributed borrowing can be actionable, including reuse of ideas. Copyright infringement requires substantial similarity of protected expression; ideas themselves are not protected. The burden of proof and definition of harm differ between the two frameworks.

What is Rabin-Karp fingerprinting in plagiarism detection?

Rabin-Karp is a rolling-hash algorithm that converts overlapping windows of text into numeric fingerprints and identifies matching passages across documents. It is fast and accurate for verbatim copying but cannot detect paraphrase, because changed wording produces different hash values.

Test yourself on Forensic Linguistics with free, timed mocks.

Practice Forensic Linguistics questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Key Takeaways

Your journey to becoming a forensic professional starts here.