Sequence Alignment, BLAST, GenBank, Mitomap and EMPOP

Q: What does an E-value of 1e-50 mean in a BLAST result, and why must it be reported alongside percent identity?

The E-value is the number of alignments at least as good as the observed one expected by chance alone in a database of the given size. An E-value of 1e-50 means one such chance hit in 10^50 searches. It differs from percent identity: a 98% identity match across 100 bp may have a modest E-value because the alignment is short, whereas the same 98% identity across 1,000 bp produces a far lower E-value. SWGWILD and ENFSI Wildlife Working Group guidelines require both metrics because reporting only percent identity makes short, high-identity chance matches appear more significant than they are. A forensic report should state both values and note whether the top hit is unambiguous or whether multiple species cluster at similar scores.

Q: Why can a BLAST search of GenBank produce a misleading forensic result even when the query and search are technically correct?

GenBank accepts sequences from any laboratory and does not independently verify taxonomic annotations before inclusion. A submitted sequence can carry an incorrect species name due to mislabelled reference material, lab contamination, or post-deposition taxonomic reclassification. A BLAST query against such an entry returns a confident match to the wrong species. This is why wildlife forensics workflows pair a GenBank BLAST with a query against a curated database such as BOLD Systems for COI barcodes, or EMPOP's SAM2-verified dataset for human mtDNA. For an applied example, see [species identification by cytochrome-b, COI and 16S rRNA](/topics/forensic-biotechnology/species-identification-by-cytochrome-b-coi-and-16s-rrna).

Q: What is the INSDC, and why does it matter that GenBank, ENA, and DDBJ are part of it?

The International Nucleotide Sequence Database Collaboration (INSDC) is the tri-node partnership between GenBank (NCBI, US), the European Nucleotide Archive (EMBL-EBI, UK/EU), and the DNA Data Bank of Japan (DDBJ). All three nodes synchronise submissions daily, so a sequence deposited at any one node appears in all three with an identical accession number. For forensic practice, a BLAST search against the GenBank nr database covers the same data as ENA or DDBJ. EMBL accession numbers cited in European court reports are retrievable from NCBI GenBank and vice versa.

Q: How does EMPOP's SAM2 pipeline improve mtDNA frequency estimate reliability compared to a direct GenBank query?

SAM2 performs phylogenetic placement against the Phylotree human mtDNA phylogeny for every submitted sequence, identifies positions inconsistent with the assigned haplogroup, and flags chimeras, mislabelled sequences, and nuclear mitochondrial DNA inserts (NUMTs). Sequences failing the screen are excluded or held pending correction. EMPOP's frequency estimates are therefore calculated on a phylogenetically verified dataset, unlike a GenBank query where sequences of uneven quality contribute equally. For court reports under R v. Doheny principles (UK) or BSA 2023 (India), the provenance and quality of the frequency database must be disclosed; EMPOP's documented QC pipeline satisfies this requirement. The [NGS data analysis](/topics/forensic-biotechnology/ngs-data-analysis-allele-calling-and-variant-callers) topic covers the upstream pipeline that generates sequences submitted to EMPOP.

The reference databases and alignment tools every NGS-era forensic analyst uses: pairwise and multiple sequence alignment (BLAST, Clustal Omega, MAFFT), GenBank for species ID, Mitomap and EMPOP for human mtDNA, and the YHRD population database for Y-STR haplotypes, with the quality-control rules each repository enforces.

Last updated: 5 Jun 2026

When a forensic sequencing run produces a string of A, T, G, and C characters, those characters mean nothing until they are placed against a reference. Sequence alignment is the computational act of lining two or more sequences alongside each other, column by column, to measure identity and locate differences. This is the gateway step between raw laboratory output and a reportable forensic interpretation. A wildlife examiner in New Zealand comparing a seized tissue sample against GenBank's vertebrate records, a forensic odontologist in Germany querying EMPOP for a mitochondrial DNA haplotype match, and a detective in the United States running BLAST on a bacterial sequence from a bioterror letter are all performing the same logical operation: placing an unknown sequence within a landscape of known ones.

Key takeaways

BLAST uses heuristic local alignment (Smith-Waterman logic with 11-mer seeds) to search GenBank in seconds; forensic reporting requires both the E-value and percent identity from the top five hits, not the top hit name alone.
GenBank does not independently verify submitted taxonomic annotations; cross-validating a BLAST hit against a curated database such as BOLD Systems (for COI barcodes) or EMPOP (for human mtDNA) is mandatory in accredited wildlife and mtDNA casework.
EMPOP's SAM2 pipeline screens every submitted mtDNA sequence for phylogenetic inconsistencies before it enters the frequency dataset, making it the required frequency source for court-reportable mtDNA haplotype estimates under ISFG, FBI, and UK Forensic Science Regulator guidelines.
Mitomap (Emory University) curates pathogenic and polymorphic human mtDNA variants against the rCRS reference (GenBank NC_012920); EMPOP (Innsbruck) provides population-frequency tables formatted for forensic likelihood-ratio calculations.
YHRD (Humboldt University Berlin) holds over 300,000 male haplotypes from more than 1,400 population samples; for zero-observation haplotypes the ISFG-endorsed minimal estimator assigns a frequency of 1/N, not zero.

The reference databases that anchor this process are maintained by a network of national and international institutions: the National Center for Biotechnology Information (NCBI) at the US National Library of Medicine maintains GenBank; the European Molecular Biology Laboratory (EMBL-EBI) in Cambridge and Hinxton maintains the European Nucleotide Archive (ENA); the DNA Data Bank of Japan (DDBJ) at the National Institute of Genetics maintains the third node of the International Nucleotide Sequence Database Collaboration (INSDC), which synchronises all three databases daily. Forensic-specific overlays built on these foundations include the European DNA Profiling Group's (EDNAP) mtDNA database, now hosted as EMPOP at Innsbruck Medical University, Austria; the Mitomap human mitochondrial genome database at Emory University, US; and the Y-Chromosome Haplotype Reference Database (YHRD) coordinated from Humboldt University Berlin.

Each repository imposes its own submission standards, nomenclature rules, and quality filters. A forensic examiner who does not understand those standards may query a database correctly and still draw the wrong conclusion, because the sequence being matched against was itself submitted with a notation that flags it as provisional or from an uncommon haplogroup subtype. This topic maps the alignment algorithms, the databases, and the QC rules that govern each one.

Pairwise Alignment: BLAST and the Smith-Waterman Logic

The fastest way to find what an unknown sequence is related to involves a deliberate sacrifice of sensitivity for speed, and knowing exactly what that sacrifice costs is what separates a reliable forensic query from a misleading one.

Sequence alignment begins with two sequences and asks: what arrangement of matches, mismatches, and gaps makes these sequences most similar? Pairwise alignment follows two main algorithmic approaches. Global alignment (the Needleman-Wunsch algorithm, 1970) aligns the full length of both sequences end-to-end, penalising gaps and mismatches but guaranteeing that every position is accounted for. Local alignment (the Smith-Waterman algorithm, 1981) finds the highest-scoring contiguous region of similarity between two sequences, allowing the rest to be ignored. In forensic species identification, where an unknown sequence fragment may correspond to only part of a reference gene, local alignment is nearly always the appropriate choice.

BLAST (Basic Local Alignment Search Tool), developed at NCBI and first published in 1990, is the most widely used implementation of heuristic local alignment. Rather than computing the full Smith-Waterman dynamic programming matrix for every sequence in a database, BLAST first identifies short exact matches (seeds, typically 11 nucleotides for the nucleotide variant blastn), then extends those seeds only if they score above a threshold. This sacrifices completeness against very divergent sequences but reduces search time from polynomial to near-linear, making GenBank-scale searches feasible in seconds. The output includes an E-value (expected number of hits as good as the result by chance alone) and a bit score. A forensic examiner interpreting a BLAST result should report both metrics: the percent identity and E-value, not just the top-hit name.

In the United States, NIST's Organisation of Scientific Area Committees for Forensic Science (OSAC) has not yet published a specific standard for BLAST query interpretation, but the Scientific Working Group for Wildlife Forensics (SWGWILD) guidelines recommend reporting the top five BLAST hits with percent identity and E-value, and noting whether the top hit is unambiguous or whether multiple species share near-identical top scores. In the European Union, ENFSI's Wildlife Working Group guidelines for species identification carry a similar requirement. In India, wildlife forensics conducted under the Wildlife Institute of India's laboratory framework increasingly follows the SWGWILD model; the broader evidential-weight framework connecting sequence identity to court-admissible probability is addressed in random match probability and the likelihood ratio.

BLAST pairwise alignment pipeline: query sequence generates seeds, seeds are extended against the GenBank database, and hits are ranked by E-value and percent identity before a species call is made. A minimum of three independent top hits with E-value below 1e-50 is a reasonable forensic threshold.

Multiple Sequence Alignment: Clustal Omega, MAFFT and MUSCLE

Comparing one sequence against a reference is the beginning; placing it within a family of related sequences to determine its exact position in a population is the interpretive step that gives a forensic result its statistical weight.

Multiple sequence alignment (MSA) arranges three or more sequences simultaneously so that homologous positions in different sequences are stacked in the same column. MSA is the prerequisite for phylogenetic analysis, population-frequency estimation, and haplogroup assignment. The principal MSA tools a forensic bioinformatician uses are Clustal Omega (updated successor to the original ClustalW), MAFFT (Multiple Alignment using Fast Fourier Transform), MUSCLE (Multiple Sequence Comparison by Log-Expectation), and T-Coffee (which improves accuracy by combining pairwise alignments before producing the multiple alignment).

For mitochondrial DNA forensics, MAFFT and Clustal Omega are the most widely cited in peer-reviewed validation studies. MAFFT's FFT-NS-2 iterative mode is fast enough to handle the mtDNA population datasets used for EMPOP queries (tens of thousands of sequences), while its L-INS-i mode provides higher accuracy for smaller, more divergent sets. The ENFSI DNA Working Group's 2019 guidelines on mtDNA analysis list MAFFT explicitly as an acceptable alignment tool for European forensic labs. The US FBI Laboratory's mtDNA guidelines similarly require a documented MSA step before haplogroup assignment and population frequency estimation.

The output of an MSA is a gap-containing alignment file, typically in FASTA or PHYLIP format, where inserted hyphens represent gaps in a particular sequence relative to the consensus. For forensic reporting, the alignment must be assessed for: (1) correct handling of hypervariable regions in the mitochondrial control region (HV1 and HV2), where insertion and deletion polymorphisms create genuine ambiguity about alignment position; (2) confirmation that the reference sequence used (the revised Cambridge Reference Sequence, rCRS, accession NC_012920) is correctly incorporated; and (3) notation of any heteroplasmic positions in the query sequence.

Tool	Algorithm	Speed	Accuracy	Best forensic use
Clustal Omega	HMM-based progressive	Fast (parallelised)	Good for closely related seqs	mtDNA population alignment for EMPOP
MAFFT (L-INS-i)	Iterative refinement	Moderate	High for < 200 seqs	Final alignment before phylogenetics
MAFFT (FFT-NS-2)	FFT + progressive	Very fast	Moderate	Large population datasets
MUSCLE	Progressive + refinement	Fast	High	STR flanking-region alignment
T-Coffee	Combined pairwise + MSA	Slow	Highest	Cross-validation of critical regions

GenBank: The World's Sequence Repository and Its QC Rules

A database is only as trustworthy as what it lets in, and GenBank's submission pipeline is both the most accessible public repository in forensic bioinformatics and the most frequently misunderstood.

GenBank, maintained by NCBI/NLM in Bethesda, Maryland, is the primary public nucleotide sequence repository and the default BLAST search target. As of 2024, GenBank holds over 10 trillion nucleotide bases across hundreds of millions of records spanning all described kingdoms of life. The database is part of the INSDC, meaning every submission to GenBank is mirrored in real time to ENA (EMBL-EBI, Hinxton, UK) and DDBJ (National Institute of Genetics, Mishima, Japan). A sequence submitted to any one of the three nodes is automatically assigned an accession number visible in all three.

GenBank does not perform independent biological verification of submitted sequences. Submissions pass through automated checks (format validation, vector contamination screening via VecScreen, and basic annotation review) but the scientific accuracy of the taxonomic identification is the submitter's responsibility. This is the single most important QC limitation for forensic use: a sequence deposited with a wrong species name, even in good faith, will appear as a legitimate match in a BLAST search. The forensic response to this limitation is to cross-validate species identifications using BOLD Systems (for COI barcodes), where sequences are submission-validated against reference libraries curated by taxonomy experts.

When submitting a forensic sequence to GenBank (for example, when contributing evidentiary sequence data to a publication or a shared database), submitters must provide: the organism name at the species level, the isolation source, the collection date, the country of collection, and the sequence coordinates corresponding to the annotated gene. NCBI's BioSample framework requires these metadata fields for all genome-scale submissions. For forensic publications requiring GenBank accession numbers as a condition of journal submission, these metadata requirements must be completed before submission review begins, a step that has delayed several forensic wildlife-trafficking publications.

Mitomap and EMPOP: Human mtDNA Databases with Forensic QC

Two databases curate the same mitochondrial genome but with different primary audiences, and choosing the right one for a forensic mtDNA query is the difference between a population-frequency estimate a court will accept and one it will not.

Human mitochondrial DNA forensics requires two distinct computational steps: haplogroup assignment (placing the sample within the phylogenetic tree of human mtDNA diversity) and population frequency estimation (asking how common the observed haplotype is in the relevant reference population). Two databases dominate:

Mitomap (hosted at Emory University, Atlanta, Georgia, US) is the primary repository for pathogenic and polymorphic mtDNA variants. It curates published mtDNA sequences, disease associations, and population polymorphism data, and it includes the complete rCRS reference sequence (accession NC_012920). Mitomap is the authoritative resource for variant interpretation in medical contexts and is increasingly used in forensic identification of degraded human remains, where a clinical mutation database provides context that a population-only database cannot. Mitomap does not, however, provide population-frequency tables formatted for forensic likelihood-ratio calculations.

EMPOP (EMtDNA POPulation database), curated at the Institute of Legal Medicine at Innsbruck Medical University, Austria, is built specifically for forensic mtDNA haplotype frequency estimation. EMPOP currently holds over 50,000 quality-controlled mitochondrial sequences from population studies across more than 130 populations worldwide. Its forensic QC pipeline is the most stringent of any mtDNA resource: submissions pass through the empop quality assurance pipeline (SAM2, a tool that screens for phylogenetic inconsistencies, sequencing artefacts, and nomenclature errors against the PhyloTree human mtDNA phylogeny curated by van Oven and Kayser). Only sequences passing SAM2 are included in the database used for forensic frequency queries.

The ISFG DNA Commission's recommendations on mtDNA analysis, updated in 2014, state that population frequency estimates for forensic court reporting should be derived from EMPOP rather than ad hoc literature searches. The FBI Laboratory uses EMPOP for its US population mtDNA frequency database. The UK Forensic Science Regulator's approved mtDNA methodology references EMPOP as the standard frequency resource. German state forensic laboratories (under BKA coordination) use EMPOP with population-specific subsets for German and European populations.

mtDNA forensic interpretation pipeline: the evidentiary sequence is aligned to the rCRS, haplogroup is assigned via PhyloTree, and the full haplotype is queried against EMPOP population subsets to generate a match probability for court reporting. Each step must be documented in the case file.

YHRD: Y-STR Haplotype Population Database

A Y-STR haplotype shared by a suspect and evidence sample might be carried by one man in ten thousand in a global city or one in fifty in an isolated town, and the difference is entirely in which population dataset you query.

The Y-Chromosome Haplotype Reference Database (YHRD, coordinated at Humboldt University Berlin and accessible at yhrd.org) is the primary resource for Y-STR haplotype frequency estimation. As of 2024, YHRD holds over 300,000 male haplotypes from more than 1,400 population samples across more than 130 countries, making it the largest curated Y-STR resource in the world. Submissions are accepted through a standardised process requiring: (1) population designation at the country and metapopulation level; (2) haplotype data in the YHRD standard format (a defined panel of Y-STR loci including the minimal haplotype and, for more recent submissions, Yfiler Plus and PowerPlex Y23 panels); (3) documentation of informed consent and institutional ethics approval; (4) confirmation that the sample is unrelated males (YHRD excludes close relatives because they share identical Y-haplotypes by descent, which would artificially inflate haplotype frequencies).

The ISFG DNA Commission's 2014 recommendations on Y-STR analysis specify that Y-STR haplotype frequencies used in court reporting should be derived from YHRD using population samples appropriate to the case. The metapopulation groupings in YHRD (European, North African, Sub-Saharan African, East Asian, South Asian, Latin American, and others) provide the minimum granularity required for a defensible forensic frequency estimate. Country-specific subsets are available for the most common populations.

The quality-control mechanism that distinguishes YHRD from a simple literature aggregation is the "augmented" search, which applies a counting method (the modified ceiling method or the SWGDAM-recommended counting method) to produce a frequency estimate with a 95% confidence interval rather than a point estimate. For low-frequency haplotypes with zero observations in a given population, YHRD applies the count-based minimal estimator (1/N, where N is the number of haplotypes in the database for that population), which is the forensic convention endorsed by the ISFG and accepted in German, Dutch, and UK courts.

Key terms

E-value (BLAST): The expected number of alignments as good as the query result that would occur by chance in a database of the given size. An E-value of 1e-100 means one such hit is expected in 10^100 random searches. For forensic species identification, E-values below 1e-50 with identity above 98% are typically considered sufficient for a confident species call.
rCRS: The revised Cambridge Reference Sequence (GenBank accession NC_012920), the standard reference for human mitochondrial DNA. All mtDNA variant positions in forensic and clinical reports are numbered relative to the rCRS. It replaced the original Cambridge Reference Sequence (cRS) in 1999 after resequencing identified a small number of errors and an A-to-C change at position 263.
EMPOP SAM2: The submission quality-assurance module used by the EMPOP database to screen incoming mtDNA sequences for phylogenetic inconsistencies, sequencing artefacts, and nomenclature errors. Only sequences passing SAM2 are included in the frequency-estimation dataset.
INSDC: International Nucleotide Sequence Database Collaboration, the three-way partnership between GenBank (NCBI/NLM, US), ENA (EMBL-EBI, UK/EU), and DDBJ (National Institute of Genetics, Japan) that synchronises public nucleotide sequence data globally.
Haplogroup: A monophyletic branch of the human mitochondrial (or Y-chromosome) phylogenetic tree defined by a set of derived mutations. Haplogroup assignment is the first step in placing an mtDNA or Y-STR haplotype within the known landscape of human genetic diversity.

Worked example

mtDNA BLAST Query in a Cross-Border Trafficking Case

A seized parcel of dried meat arrives at Indira Gandhi International Airport. The customs officer finds no CITES documentation. The Wildlife Institute of India's forensic genetics unit amplifies cytochrome-b and submits the 650-bp sequence to BLAST.

The top BLAST hit returns as Panthera tigris (Bengal tiger) at 99.5% identity, E-value 2e-318. The second hit is Panthera leo (African lion) at 97.1% identity, E-value 3e-290. Both are Schedule I species under the Wildlife Protection Act, 1972. The examiner must now decide how to report this alignment before referencing EMPOP and MITOMAP for haplogroup placement.

Following SWGWILD reporting guidelines, the examiner documents the top five hits with percent identity and E-value. The gap between the first and second hits, 2.4% identity points across a 650-bp amplicon, represents approximately 15 mismatches: a biologically meaningful distance exceeding the inter-species boundary for this gene region. The examiner then queries BOLD Systems (the Barcode of Life Data System) as an independent COI-barcode cross-check, which returns an unambiguous tiger match at 99.8% with no near-misses from related species. The species call is reported as Panthera tigris with high confidence.

For the haplogroup placement of the mtDNA sequence, the examiner aligns the HV1 region (nucleotides 15,996 to 16,401, rCRS coordinates) against EMPOP using its SAM2 pipeline, which confirms a haplogroup assignment consistent with South Asian tiger populations and rejects the sequence as phylogenetically inconsistent with African lion, leopard, or domestic cat. The EMPOP quality filter is essential: a direct GenBank search would have returned the sequence hit without the phylogenetic consistency check that SAM2 applies to every EMPOP submission.

The final forensic report states: species identification as Panthera tigris (Bengal tiger) based on BLAST percent identity 99.5%, E-value 2e-318, with BOLD barcode cross-confirmation; haplogroup placement by EMPOP SAM2 consistent with South Asian tiger population; result presented as evidence to the Wildlife Crime Control Bureau under WPA 1972. In the UK, an equivalent case would be reported under the Control of Trade in Endangered Species (Enforcement) Regulations 2018. ENFSI Wildlife Working Group guidelines carry the same reporting standard for EU member states.

What does an E-value of 1e-50 mean in a BLAST result, and why must it be reported alongside percent identity?

The E-value is the number of alignments at least as good as the observed one expected by chance alone in a database of the given size. An E-value of 1e-50 means one such chance hit in 10^50 searches. It differs from percent identity: a 98% identity match across 100 bp may have a modest E-value because the alignment is short, whereas the same 98% identity across 1,000 bp produces a far lower E-value. SWGWILD and ENFSI Wildlife Working Group guidelines require both metrics because reporting only percent identity makes short, high-identity chance matches appear more significant than they are. A forensic report should state both values and note whether the top hit is unambiguous or whether multiple species cluster at similar scores.

Why can a BLAST search of GenBank produce a misleading forensic result even when the query and search are technically correct?

GenBank accepts sequences from any laboratory and does not independently verify taxonomic annotations before inclusion. A submitted sequence can carry an incorrect species name due to mislabelled reference material, lab contamination, or post-deposition taxonomic reclassification. A BLAST query against such an entry returns a confident match to the wrong species. This is why wildlife forensics workflows pair a GenBank BLAST with a query against a curated database such as BOLD Systems for COI barcodes, or EMPOP's SAM2-verified dataset for human mtDNA. For an applied example, see [species identification by cytochrome-b, COI and 16S rRNA](/topics/forensic-biotechnology/species-identification-by-cytochrome-b-coi-and-16s-rrna).

What is the INSDC, and why does it matter that GenBank, ENA, and DDBJ are part of it?

The International Nucleotide Sequence Database Collaboration (INSDC) is the tri-node partnership between GenBank (NCBI, US), the European Nucleotide Archive (EMBL-EBI, UK/EU), and the DNA Data Bank of Japan (DDBJ). All three nodes synchronise submissions daily, so a sequence deposited at any one node appears in all three with an identical accession number. For forensic practice, a BLAST search against the GenBank nr database covers the same data as ENA or DDBJ. EMBL accession numbers cited in European court reports are retrievable from NCBI GenBank and vice versa.

How does EMPOP's SAM2 pipeline improve mtDNA frequency estimate reliability compared to a direct GenBank query?

SAM2 performs phylogenetic placement against the Phylotree human mtDNA phylogeny for every submitted sequence, identifies positions inconsistent with the assigned haplogroup, and flags chimeras, mislabelled sequences, and nuclear mitochondrial DNA inserts (NUMTs). Sequences failing the screen are excluded or held pending correction. EMPOP's frequency estimates are therefore calculated on a phylogenetically verified dataset, unlike a GenBank query where sequences of uneven quality contribute equally. For court reports under R v. Doheny principles (UK) or BSA 2023 (India), the provenance and quality of the frequency database must be disclosed; EMPOP's documented QC pipeline satisfies this requirement. The [NGS data analysis](/topics/forensic-biotechnology/ngs-data-analysis-allele-calling-and-variant-callers) topic covers the upstream pipeline that generates sequences submitted to EMPOP.

Practice

Question 1 of 5· 0 answered

A forensic wildlife examiner queries BLAST with a 650-bp cytochrome-b sequence from seized bushmeat. The top hit is Panthera leo (African lion) at 99.2% identity, E-value 1e-320. The second hit is Panthera pardus (leopard) at 98.4% identity, E-value 1e-295. What is the most appropriate forensic conclusion?

Test yourself on Forensic Biotechnology with free, timed mocks.

Practice Forensic Biotechnology questions

Found this useful? Pass it along.

Spotted an error in this page? Report a correction or read our editorial standards.

Your journey to becoming a forensic professional starts here.