Phylogenetics for Species Identification and Lineage Reconstruction

How a forensic biotechnologist places an unknown sequence on a tree: neighbour-joining, maximum likelihood and Bayesian phylogenetic methods, bootstrap support, the cytochrome-b and COI barcoding standards, and the wildlife-forensic and microbial-attribution cases that turned phylogenetics into court-admissible evidence.

Last updated: 18 Jun 2026

A phylogenetic tree is a hypothesis about evolutionary relationships expressed as a branching diagram. When a forensic scientist builds one from evidentiary sequences, the tree becomes something more specific: a geometrical argument that places an unknown sequence closer to one species, population, or individual than to any other. Courts in three continents have accepted this argument in diverse contexts. In the 1992 Florida HIV dental-transmission case (Metzker et al., Science, 2002 retrospective; original forensic analysis completed in 1992), Bayesian phylogenetics demonstrated that the viral sequences in five patients clustered with the sequences from their HIV-positive dentist rather than with local community controls, making transmission the parsimonious explanation. In the 2007 Libyan HIV trial, phylogenetics was used to show that the strain infecting Benghazi children predated the arrival of the medical personnel accused of deliberate infection. In the Amerithrax investigation following the 2001 anthrax letter attacks, whole-genome phylogenetics of Bacillus anthracis Ames strain narrowed the source to a specific flask in a specific US Army laboratory.

Key takeaways

Three methods are used in court: neighbour-joining (screening only), maximum likelihood (RAxML-NG, IQ-TREE), and Bayesian MCMC (MrBayes, BEAST); bootstrap support above 95% or posterior probability above 0.95 is required for strong forensic claims.
Community controls are mandatory in viral transmission cases; their exclusion from the evidence-source clade is the statistical test that distinguishes direct linkage from shared regional viral diversity.
BEAST adds a molecular clock to Bayesian phylogenetics, estimating when lineages diverged; this was used in the 2007 Libyan HIV trial to show the Benghazi strain predated the accused medical workers by years.
Phylogenetic proximity establishes biological closeness but cannot determine the direction of transmission; non-molecular evidence of access and opportunity is required to convert source attribution into individual criminal attribution.
ENFSI best-practice guidelines (2016/2021) require six documentation elements in the case file: alignment software and parameters, substitution model, outgroup sequence, bootstrap or MCMC chain length, software version with random seed, and GenBank accession numbers for all reference sequences.

These cases differ in scale and in the organisms involved, but they share a common computational structure: align the sequences, estimate a tree, measure how robustly each branch is supported, and report whether the unknown groups closer to the suspect source than to everything else. The tools that perform these operations, MEGA, MrBayes, RAxML, IQ-TREE, and BEAST, are standard in academic evolutionary biology but require forensic-specific calibration before their output can reach a courtroom.

This topic explains the three main phylogenetic methods, the statistics that measure their reliability, and the evidentiary standards three jurisdictions apply when a tree is offered as expert evidence.

Neighbour-Joining: Distance-Based Trees and Their Forensic Limits

The simplest phylogenetic method produces a tree in seconds, and the same simplicity that makes it useful for rapid screening is what makes it inadmissible as a standalone forensic conclusion.

Neighbour-joining (NJ), developed by Saitou and Nei in 1987, constructs a tree by iteratively pairing the two sequences (or operational taxonomic units, OTUs) that minimise the total branch length at each step. The algorithm operates on a pairwise distance matrix calculated from a multiple sequence alignment, where distances are typically corrected for multiple substitutions at the same site using models such as Kimura 2-parameter (K2P) or Jukes-Cantor (JC). K2P, which treats transitions (purine-to-purine or pyrimidine-to-pyrimidine substitutions) and transversions (purine-to-pyrimidine) separately, is the BOLD Systems default for COI barcode species identification.

NJ is computationally inexpensive enough to run on a laptop for hundreds of sequences in under a minute. The BOLD Systems barcode engine uses NJ trees as a first-pass species identification method for its COI reference library, and many wildlife forensic laboratories use NJ as a preliminary screen before committing to a full maximum-likelihood analysis. In Germany, the Federal Criminal Police Office (BKA) laboratory uses NJ trees for preliminary species calls in ivory and timber casework, with confirmatory maximum-likelihood analysis performed before court reporting.

The forensic limitation of NJ is that distance-matrix methods lose the character-state information (the actual nucleotide at each position) that model-based methods use. NJ does not test competing topologies, it does not provide a statistical model that can be independently validated, and it does not quantify uncertainty in branch placement in the same probabilistic sense that Bayesian methods do. In most jurisdictions, NJ is acceptable as a screening tool but not as the sole phylogenetic evidence in a contested case.

Bootstrap analysis, applying to NJ trees, involves resampling the alignment columns with replacement to generate 100-1000 replicate datasets, constructing an NJ tree from each, and recording the percentage of replicates in which a given branch node appears. Bootstrap support above 70% is conventionally regarded as moderate support, above 95% as strong. The SWGWILD guidelines recommend reporting bootstrap values in published forensic phylogenetic analyses regardless of the tree-building method used.

Maximum Likelihood and Bayesian Methods: The Court-Grade Approaches

The two model-based phylogenetic methods frame their answers in the language of probability, and that probabilistic framing is exactly what forensic reporting under the likelihood-ratio framework requires.

Maximum likelihood (ML) phylogenetics, as implemented in RAxML-NG, IQ-TREE, and the PhyML engine, evaluates tree topology and branch lengths by finding the parameter values that maximise the probability of observing the data under a specified nucleotide substitution model. The substitution model (GTR+G+I is the most common for forensic mitochondrial data, where GTR is general time reversible, G accounts for among-site rate variation via a gamma distribution, and I allows for a proportion of invariable sites) is selected by model-testing procedures implemented in tools such as ModelTest-NG or IQ-TREE's built-in ModelFinder. Selecting the best-fit model and documenting the selection procedure is a reproducibility requirement: the 2016 updated ENFSI best practice guidelines for forensic molecular genetic analysis recommend documenting the substitution model and model-selection procedure in the case file.

Bayesian phylogenetics, as implemented in MrBayes and BEAST, treats tree topology and model parameters as random variables and estimates their joint posterior probability distribution using Markov chain Monte Carlo (MCMC) sampling. The output is a posterior probability (PP) for each node: the proportion of MCMC samples in which that node appeared. A PP of 0.95 means that in 95% of sampled trees the clade was present, an interpretation directly comparable to a 95% confidence statement. This probabilistic output maps naturally onto the forensic likelihood-ratio framework favoured by ENFSI, ISFG, and SWGDAM.

The Florida HIV case (1992, prosecuted under Florida Statute 384.24) was one of the first criminal cases to use phylogenetic evidence. The prosecution's phylogenetic analysis, performed by using ML and Bayesian methods on HIV-1 env gene sequences, showed that the five patients' viral sequences were nested within the dentist's viral sequences in every well-supported tree. The analysis was replicated by independent groups, and the replication confirmed the branching structure. The case established several forensic phylogenetics principles that the NIAID and CDC subsequently formalised: sequences from community controls must be included; the analysis must be performed by independent analysts; the tree must be displayed with branch lengths; and the bootstrap or posterior probability values must be reported.

In the 2007 Libyan HIV trial, phylogenetics conducted by Luc Montagnier and Vittoria Colizzi showed that the Benghazi children's viral strains had been circulating in Libya since the mid-1990s, predating the arrival of the accused medical personnel by years, a BEAST-based molecular-clock analysis. The phylogenetic evidence was central to the eventual release of the accused and formed the basis of the European Parliament's report on the trial.

Simplified phylogenetic tree showing forensic placement of an unknown sequence (U) within a reference population. Bootstrap values above node branches indicate support for each clade. The unknown clusters with Reference Group A (bootstrap 97%), not Reference Group B (bootstrap 88%), supporting attribution to Group A.

Method selection for forensic phylogenetics: screening only warrants neighbour-joining; court-reportable attribution requires ML or Bayesian MCMC; transmission timing (Libyan HIV trial) requires BEAST with molecular clock.

COI Barcoding and Cytochrome-b: The Standard Forensic Markers

Two mitochondrial genes have been provisionally declared the universal forensic species-identification markers, but the declaration comes with caveats that every wildlife case file should address.

The cytochrome c oxidase subunit I (COI) gene, a 648-bp region amplifiable with the universal LCO1490/HCO2198 primer pair, was proposed by Paul Hebert and colleagues (University of Guelph, Canada) in 2003 as the universal DNA barcode for animal species identification. The Barcode of Life Data Systems (BOLD), also maintained at Guelph, holds the reference library for COI-based identification across hundreds of thousands of species. BOLD's identification engine uses NJ trees against its curated library and reports a percent similarity to the nearest named species. For vertebrates, a COI sequence above 97-98% similarity to a single species, with at least 2% divergence from the next-closest species, is the conventional forensic threshold for species identification.

Cytochrome-b (cyt-b), a 359 to 1140-bp region depending on the primer pair, is the traditional forensic vertebrate identification marker that preceded the formal barcoding movement. It has a larger body of forensic validation literature, particularly for wildlife casework. The USFWS National Fish and Wildlife Forensics Laboratory (NFWFL) in Ashland, Oregon, the world's only accredited laboratory dedicated entirely to wildlife forensics, uses cyt-b as its primary species identification marker for mammalian casework, backed by an internal validated reference library. In the European Union, the TRACES network, coordinated through national wildlife forensics laboratories, uses cyt-b alongside COI for CITES species identification.

In India, the Wildlife Institute of India's Forensic Laboratory (WII, Dehradun) uses cyt-b and COI for tiger, leopard, elephant, and rhinoceros casework under the Wildlife Protection Act 1972. Published studies from WII have demonstrated differentiation of all five Indian big cats (lion, tiger, leopard, snow leopard, clouded leopard) using cyt-b sequences with greater than 98% pairwise identity within species and greater than 5% between species. The same methodology has been used in ivory-trafficking prosecutions under CITES Appendix I.

Sample collection
Tissue, feather, scale, or ivory biopsy collected under chain-of-custody protocols. Seizure evidence documented with CITES-standard paperwork.
DNA extraction
Silica-column or magnetic-bead extraction. For hard tissues (bone, ivory), decalcification and overnight lysis at 56°C before column binding.
PCR amplification
Universal COI primers (LCO1490/HCO2198) or cyt-b primers amplify the target. Negative extraction and PCR controls run in parallel.
Sanger sequencing
Bidirectional capillary electrophoresis sequencing on ABI 3500 or equivalent. Both strands must be sequenced for forensic-grade accuracy.
BLAST query + BOLD ID
Query NCBI GenBank via BLAST and BOLD Systems separately. Cross-validate both results. Note E-value, percent identity, and number of reference sequences for the top hit.
Phylogenetic confirmation
For contested cases or low-confidence BLAST/BOLD matches, NJ + ML phylogenetic analysis with reference sequences downloaded from GenBank. Bootstrap support > 95% required for court reporting.

Microbial and Viral Forensic Phylogenetics: Attribution at the Strain Level

Pathogen phylogenetics entered the courtroom before most forensic biologists had heard of the technique, and the standards it established for replication and independent expert review became the template for every case that followed.

Microbial forensics is the application of molecular phylogenetics to attribute a pathogen to its source, whether that source is an individual (in transmission cases), a facility (in biosecurity attribution), or a geographic reservoir (in outbreak investigation). The discipline draws the same tree-building tools as wildlife forensics but applies them in a context where the potential for deliberate tampering with sequences and the life-or-death stakes for the accused make independent replication non-negotiable.

The Amerithrax investigation (FBI, 2001-2008) is the paradigm case. After the anthrax letter attacks killed five people and infected 17 others across the United States in October 2001, investigators used whole-genome comparative phylogenetics of Bacillus anthracis Ames strain. Four morphotype variants in the spore preparation were found to match variants in a single flask (RMR-1029) maintained at Fort Detrick, Maryland, US. The phylogenetic analysis used SNP-based maximum-parsimony trees and was subjected to independent review by the National Biodefense Analysis and Countermeasures Center (NBACC). The attribution pointed to one individual, who died before trial. The methodology, published in Science in 2011 by Keim and colleagues, became a template for microbial forensic standards published by the American Academy of Microbiology.

SARS-CoV-2 phylogenetics during the COVID-19 pandemic (2020 onwards) demonstrated the same framework at global scale. The GISAID (Global Initiative on Sharing All Influenza Data) database, hosted in Munich with mirror nodes worldwide, provided a real-time repository of SARS-CoV-2 genome sequences from laboratories in more than 180 countries. Nextstrain, an open-source phylodynamic toolkit (BEAST-based), visualised variant emergence and transmission chains, enabling public health genomic surveillance at a speed never previously achieved. The SARS-CoV-2 case established that global pathogen phylogenetics is now an operational capability, not just a research tool, with direct implications for biosecurity attribution in future outbreak investigations.

In the UK, the Health Security Agency (UKHSA) runs a National Influenza Pandemic Preparedness Plan that incorporates phylodynamic surveillance. In India, INSACOG (Indian SARS-CoV-2 Genomics Consortium), coordinated through the Biological and Health Sciences division of the Department of Biotechnology, sequenced over 100,000 SARS-CoV-2 genomes between 2021 and 2022, demonstrating the institutional capacity for large-scale pathogen phylogenetics in the Indian system.

Tools and Reproducibility: MEGA, MrBayes, RAxML, IQ-TREE and BEAST

A phylogenetic tree is only as reproducible as the parameter choices that built it, and reproducibility is the first thing an opposing expert examines when forensic phylogenetics is challenged in court.

Five software packages dominate forensic phylogenetic analysis. MEGA (Molecular Evolutionary Genetics Analysis, developed at Arizona State University, US, and Temple University, US) is the most accessible entry point, providing NJ and ML tree construction with a graphical interface. It is the most cited phylogenetics package in the scientific literature and is appropriate for preliminary analyses, but it is not designed for the MCMC sampling that Bayesian analysis requires. MrBayes (developed at the University of Rochester, US, and the Swedish Museum of Natural History) is the standard Bayesian phylogenetics engine for forensic applications requiring posterior probability estimates. RAxML-NG (Heidelberg, Germany) and IQ-TREE (developed with contributions from Australian and German groups) are the standard ML engines for large datasets. BEAST (Bayesian Evolutionary Analysis Sampling Trees, developed at the University of Melbourne, Australia, and University College London, UK) performs Bayesian analysis with a molecular clock, enabling estimation of divergence times, which is forensically relevant in transmission timing analysis.

Reproducibility in forensic phylogenetics requires documenting: (1) the alignment software and parameters used; (2) the substitution model and model-selection procedure; (3) the outgroup sequence (the root anchor that determines the direction of the tree); (4) the bootstrap replication count or MCMC chain length; (5) the software version and random seed used; and (6) the GenBank accession numbers of all reference sequences included. ENFSI's best-practice guidelines for forensic molecular genetics (2016, updated 2021) require all six for any phylogenetic analysis submitted in expert evidence. The FBI's Forensic Science Communication and the American Academy of Microbiology's 2012 report on microbial forensics carry equivalent requirements.

Software	Method	Output metric	Typical forensic use
MEGA	NJ, ML	Bootstrap %	Preliminary species ID, educational use
RAxML-NG	Maximum likelihood	Bootstrap %	Large datasets, rapid court timelines
IQ-TREE	ML + ModelFinder	Bootstrap % (UFBoot)	ML trees with model selection built in
MrBayes	Bayesian MCMC	Posterior probability	Human transmission cases, microbial attribution
BEAST	Bayesian MCMC + clock	PP + divergence dates	Transmission timing, outbreak dating

Worked example

HIV Transmission Case, Using Phylogenetics to Support Prosecution in a Criminal Exposure Case

A healthcare worker is charged with intentionally infecting a patient with HIV. The patient's virus must be shown to have come from the worker, not from community exposure. How does the phylogenetic tree deliver that evidence?

Scene: A criminal prosecution in a US state court (modelled on real HIV transmission cases). A healthcare worker is accused of deliberately exposing a patient to HIV. The patient tests HIV-positive 6 weeks after an injection administered by the worker. The defence argues the patient acquired HIV from community exposure.

Step 1 (Sample collection and sequencing): The HIV reverse transcriptase and envelope gene regions are sequenced from the patient, the accused healthcare worker, and 25 community controls (HIV-positive individuals in the same metropolitan area with no known link to the case).

Step 2 (Tree construction): IQ-TREE is used with ModelFinder (selecting HKY+G4 as the substitution model) to build a maximum-likelihood tree from a MUSCLE-aligned 450 bp envelope gene amplicon dataset. Bootstrap support is calculated from 1,000 ultrafast bootstrap replicates.

Step 3 (Interpretation): The patient's sequences form a monophyletic clade with the healthcare worker's sequences, with 94% bootstrap support. All 25 community controls fall in phylogenetically distinct positions outside this clade. This clustering is forensically significant because community controls falling outside demonstrates that the patient-worker clade is not a statistical artefact of community viral diversity.

Step 4 (Temporal analysis with BEAST): BEAST is run with a relaxed molecular clock and collection dates as calibration. The estimated divergence date of the patient-worker clade is consistent with transmission around the time of the alleged injection event, not years earlier as the defence suggested.

Conclusion: Phylogenetic proximity (high bootstrap clade containing only patient and worker sequences, with community controls excluded) plus temporal divergence estimation consistent with the alleged event date provide the two-part forensic conclusion: the patient's virus is closely related to the worker's virus and diverged at a time consistent with the alleged exposure. This does not prove transmission mechanism or direction, but it excludes community-source hypotheses.

Frequently asked questions

What does bootstrap support mean in a forensic phylogenetic tree?

Bootstrap support measures how reproducibly a particular clade appears when alignment data are resampled randomly with replacement (typically 1,000 replicates) and a tree is rebuilt for each resample. A value of 94% means 94% of resampled trees recovered that grouping. Values above 95% indicate strong support; 70-95% moderate; below 70% weak. ENFSI and the American Academy of Microbiology require bootstrap values to be reported, with moderate-to-weak support acknowledged in the expert report rather than presenting the tree as definitive.

Can phylogenetics prove the direction of HIV transmission between two individuals?

No. Phylogenetics establishes that two viral sequences are closely related, consistent with a transmission event, but cannot determine direction. In a criminal prosecution, other evidence (who knew their HIV status, who had access to the other's blood) must establish direction. The ISFG and American Academy of Microbiology both warn that phylogenetic proximity is consistent with direct transmission, a shared source, or a chain through unsampled intermediates. Framing phylogenetics as proving transmission direction is a common expert witness error challenged in multiple jurisdictions. The [microbial forensics](/topics/forensic-biotechnology/microbial-forensics-anthrax-letters-and-biothreat-attribution) topic covers how this principle was applied in the 1990 Florida HIV dental case.

What is BEAST and when is it used in forensic cases?

BEAST (Bayesian Evolutionary Analysis Sampling Trees) performs Bayesian phylogenetic analysis with a molecular clock. It incorporates sample collection dates as calibration points and estimates when viral lineages diverged from a common ancestor, producing a probability distribution over divergence times. This was the key tool in the 2007 Libyan HIV trial, where BEAST analysis showed the Benghazi children's strain had been circulating since the mid-1990s, years before the accused arrived. BEAST requires more parameter choices (clock model, tree prior) than ML tools like IQ-TREE, and its posterior probability outputs require careful expert interpretation.

Why do forensic phylogenetic analyses require community controls?

Community controls are pathogen sequences from individuals in the same geographic area with no epidemiological link to the case. They serve as the null hypothesis: if the patient's sequences cluster with community controls as tightly as with the accused's sequences, the clustering does not support direct transmission. Without controls, any two patients appear more related to each other than to a random reference simply because they are from the same region and time period. The Florida HIV case established community controls as a non-negotiable requirement; their exclusion from the dentist-patient clade was the evidential foundation. See [sequence alignment and BLAST](/topics/forensic-biotechnology/sequence-alignment-blast-genbank-mitomap-empop) for how sequences are prepared before phylogenetic analysis.

Practice

Question 1 of 5· 0 answered

In the 1992 Florida HIV dental-transmission case, phylogenetic evidence placed the patients' viral sequences within a clade defined by the dentist's viral sequences. Which phylogenetic principle made this placement forensically significant rather than coincidental?

Test yourself on Forensic Biotechnology with free, timed mocks.

Practice Forensic Biotechnology questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Frequently asked questions

Your journey to becoming a forensic professional starts here.