Practice with national-level exam (FACT, FACT Plus, NET, CUET, etc.) mocks, learn from structured notes, and get your doubts solved in one place.
How a forensic biotechnologist places an unknown sequence on a tree: neighbour-joining, maximum likelihood and Bayesian phylogenetic methods, bootstrap support, the cytochrome-b and COI barcoding standards, and the wildlife-forensic and microbial-attribution cases that turned phylogenetics into court-admissible evidence.
Last updated:
A phylogenetic tree is a hypothesis about evolutionary relationships expressed as a branching diagram. When a forensic scientist builds one from evidentiary sequences, the tree becomes something more specific: a geometrical argument that places an unknown sequence closer to one species, population, or individual than to any other. Courts in three continents have accepted this argument in diverse contexts. In the 1992 Florida HIV dental-transmission case (Metzker et al., Science, 2002 retrospective; original forensic analysis completed in 1992), Bayesian phylogenetics demonstrated that the viral sequences in five patients clustered with the sequences from their HIV-positive dentist rather than with local community controls, making transmission the parsimonious explanation. In the 2007 Libyan HIV trial, phylogenetics was used to show that the strain infecting Benghazi children predated the arrival of the medical personnel accused of deliberate infection. In the Amerithrax investigation following the 2001 anthrax letter attacks, whole-genome phylogenetics of Bacillus anthracis Ames strain narrowed the source to a specific flask in a specific US Army laboratory.
These cases differ in scale and in the organisms involved, but they share a common computational structure: align the sequences, estimate a tree, measure how robustly each branch is supported, and report whether the unknown groups closer to the suspect source than to everything else. The tools that perform these operations, MEGA, MrBayes, RAxML, IQ-TREE, and BEAST, are standard in academic evolutionary biology but require forensic-specific calibration before their output can reach a courtroom.
This topic explains the three main phylogenetic methods, the statistics that measure their reliability, and the evidentiary standards three jurisdictions apply when a tree is offered as expert evidence.
The simplest phylogenetic method produces a tree in seconds, and the same simplicity that makes it useful for rapid screening is what makes it inadmissible as a standalone forensic conclusion.
Neighbour-joining (NJ), developed by Saitou and Nei in 1987, constructs a tree by iteratively pairing the two sequences (or operational taxonomic units, OTUs) that minimise the total branch length at each step. The algorithm operates on a pairwise distance matrix calculated from a multiple sequence alignment, where distances are typically corrected for multiple substitutions at the same site using models such as Kimura 2-parameter (K2P) or Jukes-Cantor (JC). K2P, which treats transitions (purine-to-purine or pyrimidine-to-pyrimidine substitutions) and transversions (purine-to-pyrimidine) separately, is the BOLD Systems default for COI barcode species identification.
Test yourself on Forensic Biotechnology with free, timed mocks.
Practice Forensic Biotechnology questionsNJ is computationally inexpensive enough to run on a laptop for hundreds of sequences in under a minute. The BOLD Systems barcode engine uses NJ trees as a first-pass species identification method for its COI reference library, and many wildlife forensic laboratories use NJ as a preliminary screen before committing to a full maximum-likelihood analysis. In Germany, the Federal Criminal Police Office (BKA) laboratory uses NJ trees for preliminary species calls in ivory and timber casework, with confirmatory maximum-likelihood analysis performed before court reporting.
The forensic limitation of NJ is that distance-matrix methods lose the character-state information (the actual nucleotide at each position) that model-based methods use. NJ does not test competing topologies, it does not provide a statistical model that can be independently validated, and it does not quantify uncertainty in branch placement in the same probabilistic sense that Bayesian methods do. In most jurisdictions, NJ is acceptable as a screening tool but not as the sole phylogenetic evidence in a contested case.
Bootstrap analysis, applying to NJ trees, involves resampling the alignment columns with replacement to generate 100-1000 replicate datasets, constructing an NJ tree from each, and recording the percentage of replicates in which a given branch node appears. Bootstrap support above 70% is conventionally regarded as moderate support, above 95% as strong. The SWGWILD guidelines recommend reporting bootstrap values in published forensic phylogenetic analyses regardless of the tree-building method used.
The two model-based phylogenetic methods frame their answers in the language of probability, and that probabilistic framing is exactly what forensic reporting under the likelihood-ratio framework requires.
Maximum likelihood (ML) phylogenetics, as implemented in RAxML-NG, IQ-TREE, and the PhyML engine, evaluates tree topology and branch lengths by finding the parameter values that maximise the probability of observing the data under a specified nucleotide substitution model. The substitution model (GTR+G+I is the most common for forensic mitochondrial data, where GTR is general time reversible, G accounts for among-site rate variation via a gamma distribution, and I allows for a proportion of invariable sites) is selected by model-testing procedures implemented in tools such as ModelTest-NG or IQ-TREE's built-in ModelFinder. Selecting the best-fit model and documenting the selection procedure is a reproducibility requirement: the 2016 updated ENFSI best practice guidelines for forensic molecular genetic analysis recommend documenting the substitution model and model-selection procedure in the case file.
Bayesian phylogenetics, as implemented in MrBayes and BEAST, treats tree topology and model parameters as random variables and estimates their joint posterior probability distribution using Markov chain Monte Carlo (MCMC) sampling. The output is a posterior probability (PP) for each node: the proportion of MCMC samples in which that node appeared. A PP of 0.95 means that in 95% of sampled trees the clade was present, an interpretation directly comparable to a 95% confidence statement. This probabilistic output maps naturally onto the forensic likelihood-ratio framework favoured by ENFSI, ISFG, and SWGDAM.
The Florida HIV case (1992, prosecuted under Florida Statute 384.24) was one of the first criminal cases to use phylogenetic evidence. The prosecution's phylogenetic analysis, performed by using ML and Bayesian methods on HIV-1 env gene sequences, showed that the five patients' viral sequences were nested within the dentist's viral sequences in every well-supported tree. The analysis was replicated by independent groups, and the replication confirmed the branching structure. The case established several forensic phylogenetics principles that the NIAID and CDC subsequently formalised: sequences from community controls must be included; the analysis must be performed by independent analysts; the tree must be displayed with branch lengths; and the bootstrap or posterior probability values must be reported.
In the 2007 Libyan HIV trial, phylogenetics conducted by Luc Montagnier and Vittoria Colizzi showed that the Benghazi children's viral strains had been circulating in Libya since the mid-1990s, predating the arrival of the accused medical personnel by years, a BEAST-based molecular-clock analysis. The phylogenetic evidence was central to the eventual release of the accused and formed the basis of the European Parliament's report on the trial.
Two mitochondrial genes have been provisionally declared the universal forensic species-identification markers, but the declaration comes with caveats that every wildlife case file should address.
The cytochrome c oxidase subunit I (COI) gene, a 648-bp region amplifiable with the universal LCO1490/HCO2198 primer pair, was proposed by Paul Hebert and colleagues (University of Guelph, Canada) in 2003 as the universal DNA barcode for animal species identification. The Barcode of Life Data Systems (BOLD), also maintained at Guelph, holds the reference library for COI-based identification across hundreds of thousands of species. BOLD's identification engine uses NJ trees against its curated library and reports a percent similarity to the nearest named species. For vertebrates, a COI sequence above 97-98% similarity to a single species, with at least 2% divergence from the next-closest species, is the conventional forensic threshold for species identification.
Cytochrome-b (cyt-b), a 359 to 1140-bp region depending on the primer pair, is the traditional forensic vertebrate identification marker that preceded the formal barcoding movement. It has a larger body of forensic validation literature, particularly for wildlife casework. The USFWS National Fish and Wildlife Forensics Laboratory (NFWFL) in Ashland, Oregon, the world's only accredited laboratory dedicated entirely to wildlife forensics, uses cyt-b as its primary species identification marker for mammalian casework, backed by an internal validated reference library. In the European Union, the TRACES network, coordinated through national wildlife forensics laboratories, uses cyt-b alongside COI for CITES species identification.
In India, the Wildlife Institute of India's Forensic Laboratory (WII, Dehradun) uses cyt-b and COI for tiger, leopard, elephant, and rhinoceros casework under the Wildlife Protection Act 1972. Published studies from WII have demonstrated differentiation of all five Indian big cats (lion, tiger, leopard, snow leopard, clouded leopard) using cyt-b sequences with greater than 98% pairwise identity within species and greater than 5% between species. The same methodology has been used in ivory-trafficking prosecutions under CITES Appendix I.
Pathogen phylogenetics entered the courtroom before most forensic biologists had heard of the technique, and the standards it established for replication and independent expert review became the template for every case that followed.
Microbial forensics is the application of molecular phylogenetics to attribute a pathogen to its source, whether that source is an individual (in transmission cases), a facility (in biosecurity attribution), or a geographic reservoir (in outbreak investigation). The discipline draws the same tree-building tools as wildlife forensics but applies them in a context where the potential for deliberate tampering with sequences and the life-or-death stakes for the accused make independent replication non-negotiable.
The Amerithrax investigation (FBI, 2001-2008) is the paradigm case. After the anthrax letter attacks killed five people and infected 17 others across the United States in October 2001, investigators used whole-genome comparative phylogenetics of Bacillus anthracis Ames strain. Four morphotype variants in the spore preparation were found to match variants in a single flask (RMR-1029) maintained at Fort Detrick, Maryland, US. The phylogenetic analysis used SNP-based maximum-parsimony trees and was subjected to independent review by the National Biodefense Analysis and Countermeasures Center (NBACC). The attribution pointed to one individual, who died before trial. The methodology, published in Science in 2011 by Keim and colleagues, became a template for microbial forensic standards published by the American Academy of Microbiology.
SARS-CoV-2 phylogenetics during the COVID-19 pandemic (2020 onwards) demonstrated the same framework at global scale. The GISAID (Global Initiative on Sharing All Influenza Data) database, hosted in Munich with mirror nodes worldwide, provided a real-time repository of SARS-CoV-2 genome sequences from laboratories in more than 180 countries. Nextstrain, an open-source phylodynamic toolkit (BEAST-based), visualised variant emergence and transmission chains, enabling public health genomic surveillance at a speed never previously achieved. The SARS-CoV-2 case established that global pathogen phylogenetics is now an operational capability, not just a research tool, with direct implications for biosecurity attribution in future outbreak investigations.
In the UK, the Health Security Agency (UKHSA) runs a National Influenza Pandemic Preparedness Plan that incorporates phylodynamic surveillance. In India, INSACOG (Indian SARS-CoV-2 Genomics Consortium), coordinated through the Biological and Health Sciences division of the Department of Biotechnology, sequenced over 100,000 SARS-CoV-2 genomes between 2021 and 2022, demonstrating the institutional capacity for large-scale pathogen phylogenetics in the Indian system.
A phylogenetic tree is only as reproducible as the parameter choices that built it, and reproducibility is the first thing an opposing expert examines when forensic phylogenetics is challenged in court.
Five software packages dominate forensic phylogenetic analysis. MEGA (Molecular Evolutionary Genetics Analysis, developed at Arizona State University, US, and Temple University, US) is the most accessible entry point, providing NJ and ML tree construction with a graphical interface. It is the most cited phylogenetics package in the scientific literature and is appropriate for preliminary analyses, but it is not designed for the MCMC sampling that Bayesian analysis requires. MrBayes (developed at the University of Rochester, US, and the Swedish Museum of Natural History) is the standard Bayesian phylogenetics engine for forensic applications requiring posterior probability estimates. RAxML-NG (Heidelberg, Germany) and IQ-TREE (developed with contributions from Australian and German groups) are the standard ML engines for large datasets. BEAST (Bayesian Evolutionary Analysis Sampling Trees, developed at the University of Melbourne, Australia, and University College London, UK) performs Bayesian analysis with a molecular clock, enabling estimation of divergence times, which is forensically relevant in transmission timing analysis.
Reproducibility in forensic phylogenetics requires documenting: (1) the alignment software and parameters used; (2) the substitution model and model-selection procedure; (3) the outgroup sequence (the root anchor that determines the direction of the tree); (4) the bootstrap replication count or MCMC chain length; (5) the software version and random seed used; and (6) the GenBank accession numbers of all reference sequences included. ENFSI's best-practice guidelines for forensic molecular genetics (2016, updated 2021) require all six for any phylogenetic analysis submitted in expert evidence. The FBI's Forensic Science Communication and the American Academy of Microbiology's 2012 report on microbial forensics carry equivalent requirements.
| Software | Method | Output metric | Typical forensic use |
|---|---|---|---|
| MEGA | NJ, ML | Bootstrap % | Preliminary species ID, educational use |
| RAxML-NG | Maximum likelihood | Bootstrap % | Large datasets, rapid court timelines |
| IQ-TREE | ML + ModelFinder | Bootstrap % (UFBoot) | ML trees with model selection built in |
| MrBayes | Bayesian MCMC | Posterior probability | Human transmission cases, microbial attribution |
| BEAST | Bayesian MCMC + clock | PP + divergence dates | Transmission timing, outbreak dating |
In the 1992 Florida HIV dental-transmission case, phylogenetic evidence placed the patients' viral sequences within a clade defined by the dentist's viral sequences. Which phylogenetic principle made this placement forensically significant rather than coincidental?