Practice with national-level exam (FACT, FACT Plus, NET, CUET, etc.) mocks, learn from structured notes, and get your doubts solved in one place.
What happens between a sequencer's raw output and a reportable allele: read trimming, alignment to a reference (BWA, Bowtie2), variant calling (GATK, FreeBayes), forensic-specific tools (STRait Razor, FDSTools, MyFLq), coverage and balance filters, and the validation studies that turned MPS into an accredited forensic technique.
Last updated:
When the first massively parallel sequencing (MPS) instrument produced a FASTQ file in a forensic laboratory, the output looked nothing like an electropherogram. Instead of coloured peaks at expected allele positions, the analyst faced millions of short reads, each a 150 or 300-base text string followed by a quality score string. The path from that raw file to a reportable STR allele or a validated single-nucleotide variant runs through five sequential operations that together constitute the forensic MPS analysis pipeline.
The shift from capillary electrophoresis to MPS changes the nature of the forensic DNA examination in fundamental ways. A conventional CE-based STR kit reports a length-based allele designation (for example, D7S820 allele 10 means ten tetranucleotide repeats at that locus). MPS reports the complete sequence of the repeat and its flanking regions, enabling discrimination between isoalleles that appear identical on CE but differ by a single nucleotide within the repeat. This sequencing-based resolution increases the discrimination power of a forensic DNA profile and reduces the size of the population sharing any given haplotype, but it also requires a more complex analytical pipeline before a result can be reported.
By 2024, MPS-based forensic DNA analysis had been independently validated and accepted into accredited casework in laboratories in the United States (Verogen MiSeq FGx, validated by the FBI Laboratory in 2021 and by New York's Office of the Chief Medical Examiner in 2022), Germany (BKA, internal validation published 2020), the Netherlands (NFI, published 2018), and Australia (Victoria Police Forensic Services, published 2023). In India, the Central Forensic Science Laboratory (CFSL) in Hyderabad has published feasibility studies on MPS STR typing using Illumina platforms, and the National Institute of Biomedical Genomics (NIBMG) in Kalyani has contributed to validation of targeted MPS panels for forensic identification. This topic covers the complete analytical pipeline from FASTQ to reportable result.
Every decision a forensic analyst makes in the MPS pipeline rests on the accuracy of the base calls in the FASTQ file, and those calls come with a probabilistic quality score that many analysts read without understanding what it means.
A FASTQ file is the standard output format from Illumina, Ion Torrent, and most other sequencing platforms. Each sequenced read occupies four lines: a read identifier beginning with @, the nucleotide sequence, a separator line (+), and the quality score string. Quality scores are encoded in Phred format: a score of Q30 means a 1 in 1000 probability of a base-call error at that position (99.9% accuracy). Q20 means 1 in 100 (99% accuracy). Most forensic validation studies set a minimum read-quality threshold of Q30 for inclusion in allele calling.
Test yourself on Forensic Biotechnology with free, timed mocks.
Practice Forensic Biotechnology questionsRead trimming is the step that removes low-quality bases from read ends, adapter sequences from library preparation, and reads below a minimum length threshold before alignment. Trimmomatic (University of Bochum, Germany) and fastp (China, developed at the BGI-Shenzhen Institute) are the two most widely deployed trimming tools in forensic MPS workflows. Trimmomatic's SLIDINGWINDOW parameter removes trailing bases within a four-base window whenever the average quality drops below Q20. fastp additionally performs automatic adapter detection, a useful feature when adapter sequences vary between library preparation kits.
Trim-before-align is not a universal standard. Some forensic pipelines using the Verogen ForenSeq Universal Analysis Software (UAS) perform adapter trimming internally before alignment, and the BKA-validated workflow for the ForenSeq DNA Signature Prep kit handles trimming within the UAS pipeline rather than as a separate step. The choice of trimming tool and threshold must be documented in the method validation report, as different thresholds affect sensitivity (the ability to detect low-level alleles in mixtures) and specificity (the ability to call the correct allele in the presence of PCR artefacts).
Alignment to a reference genome sounds like a solved problem until the read contains an STR repeat region, at which point the short-read aligner's gap-opening penalty becomes the most consequential parameter in the pipeline.
Read alignment maps each trimmed FASTQ read to its position on a reference genome or a targeted reference panel. For whole-genome forensic sequencing (such as in the Verogen ForenSeq Kintelligence workflow for forensic genetic genealogy), the reference is the GRCh38 human genome assembly. For targeted amplicon-based forensic STR panels (ForenSeq DNA Signature Prep, Precision ID GlobalFiler, PowerSeq 46GY System), the reference consists of amplicon sequences covering only the targeted loci.
BWA-MEM (Burrows-Wheeler Aligner, developed at the Wellcome Sanger Institute, UK) is the standard short-read aligner for whole-genome forensic pipelines. It uses the Burrows-Wheeler transform to index the reference and a Smith-Waterman extension algorithm for local realignment around gaps. BWA-MEM handles reads from 70 bp to a few thousand base pairs and is the recommended aligner in the GATK best-practices pipeline. Bowtie2 (Langmead and Salzberg, Johns Hopkins University, US) uses a different seed-and-extend strategy optimised for speed on short reads (up to ~500 bp) and is the aligner embedded in the Verogen UAS internal pipeline for ForenSeq STR typing.
For STR-containing regions, both aligners may produce misalignments around the repeat because short reads that span only a portion of a long repeat have multiple equally valid mapping positions. The forensic solution is to use targeted, amplicon-based sequencing where each PCR primer pair flanks the repeat and the read spans the entire allele, leaving unambiguous anchoring sequences on both sides of the repeat. This design is built into the ForenSeq, Precision ID, and PowerSeq library preparation kits and is the reason targeted amplicon sequencing dominates forensic MPS rather than whole-genome shotgun sequencing for STR-typing applications.
Output from BWA or Bowtie2 is a SAM (Sequence Alignment/Map) file, converted to the compressed BAM format by samtools, and sorted and indexed for downstream processing. samtools flagstat and qualimap provide coverage statistics (mean read depth, percentage of target bases with depth above 30x, uniformity across loci) that are the input quality metrics for the subsequent filtering steps.
Variant callers designed for clinical genomics carry assumptions about ploidy and error models that hold for germline tumour samples but break down in forensic mixtures, and the analyst who does not understand those assumptions will misinterpret the output.
Variant calling is the step that identifies positions where reads differ from the reference, assigns an allele identity to each position, and generates a VCF (Variant Call Format) file recording each variant with its position, reference allele, alternative allele, quality score, and genotype likelihood. GATK (Genome Analysis Toolkit, Broad Institute, Cambridge, Massachusetts, US) and FreeBayes (Erik Garrison, University of Cambridge, UK) are the two dominant open-source variant callers, and both are used in forensic MPS workflows.
GATK HaplotypeCaller performs local de novo assembly of haplotypes within each active region of the genome, then genotypes alleles using a Bayesian likelihood model. The output includes genotype quality (GQ) scores and phased haplotypes. GATK was designed for clinical germline and somatic variant calling in diploid organisms and performs best when the expected ploidy and error model are set correctly. For forensic single-source samples, GATK's default diploid model is appropriate and produces highly accurate SNP and short-indel calls. For forensic mixtures, GATK's diploid assumption breaks down, because mixture components contribute at sub-diploid fractions, and the genotype likelihood model assigns low confidence to variants present at 10-30% allele fraction.
FreeBayes models ploidy-agnostic variant calling by treating all variants above a configurable allele-frequency threshold as candidates. This makes it more appropriate for forensic mixture analysis than GATK in its default configuration, and the Netherlands Forensic Institute's published MPS mixture workflow uses FreeBayes as the variant caller within their forensic pipeline. Configuring FreeBayes for forensic use typically requires setting the minimum-alternate-fraction parameter to 0.05-0.10 and the minimum-alternate-count to 3-5 reads to suppress PCR error artefacts while retaining minor-contributor alleles.
For STR allele calling specifically, neither GATK nor FreeBayes is optimised for the repeat-counting logic that forensic STR interpretation requires. This gap is addressed by forensic-specific tools discussed in the next section.
The generic bioinformatics pipeline delivers a variant table, but converting a repeat-region sequence into a CODIS-compatible allele designation is a forensic-specific operation that requires forensic-specific software.
Three software packages were developed specifically for forensic STR allele calling from MPS data and have been independently validated in peer-reviewed studies:
STRait Razor (STR allele identification tool, University of North Dakota, US) extracts STR reads from a BAM file, aligns them to a repeat-region reference, counts the repeat units, and assigns a CODIS-compatible allele designation (numeric) or a sequence-based allele designation (the full repeat sequence). STRait Razor was validated in the SWGDAM-approved study by Just et al. (2015) using 87 samples on the Illumina MiSeq, with concordance to CE-based typing demonstrated for all 20 CODIS loci. Version 3.0 (2019) added support for extended CODIS 20 loci and sequence-based allele reporting, enabling the isoallele discrimination that is the primary advantage of MPS over CE.
FDSTools (Forensic DNA Statistics Tools, developed at the Netherlands Forensic Institute, NL) is a more comprehensive toolkit that handles STR allele calling, noise filtering using a background noise model, and CE-to-MPS concordance analysis. FDSTools uses a stutter model trained on laboratory-specific data to distinguish true minor alleles from PCR-stutter artefacts, which is the most critical analytical step in MPS mixture interpretation. The NFI published a full validation of FDSTools for Illumina MiSeq-based STR typing in 2016, and the tool is used in operational casework at NFI.
MyFLq (My Forensic Loci Query, Ghent University, Belgium) provides a web-based and command-line interface for STR allele calling with built-in allele-frequency databases for STR reporting. It was validated using the ForenSeq DNA Signature Prep kit on the Illumina MiSeq and published by Van Neste et al. (2012). The tool's strength is its integration with allele frequency databases for forensic statistics, enabling seamless calculation of match probabilities after allele calling.
| Tool | Developer | Primary function | Validated kit | Key publication |
|---|---|---|---|---|
| STRait Razor v3 | Univ. of North Dakota (US) | STR allele calling + sequence designation | ForenSeq, GlobalFiler MPS | Just et al. 2015, 2019 |
| FDSTools | Netherlands Forensic Institute | STR calling + stutter modelling + noise filter | Illumina MiSeq amplicons | Hoogenboom et al. 2016 |
| MyFLq | Ghent University (Belgium) | Web-based STR calling + stats integration | ForenSeq DNA Sig. Prep | Van Neste et al. 2012 |
| ForenSeq UAS | Verogen (US) | End-to-end pipeline for ForenSeq kit | ForenSeq DNA Sig. Prep | Verogen v.study 2021 |
| Precision ID Reporter | Thermo Fisher (US) |
The difference between a research-grade MPS pipeline and a forensic-grade one is a set of quantitative thresholds applied before an allele call is reported, and those thresholds must be justified by the laboratory's own validation data, not by the manufacturer's insert.
Forensic MPS pipelines apply two categories of quality filter before an allele is reported: coverage thresholds and balance thresholds.
Coverage thresholds define the minimum number of reads that must map to a locus for the result to be reported. The ForenSeq UAS validation study (Verogen, 2021) sets the minimum per-locus read depth at 30 reads (reads supporting a given allele), with a recommended average depth of 150x for full-profile calls. The NFI validation for FDSTools uses a minimum of 50 reads per allele. The BKA validation for their in-house MPS pipeline uses a minimum of 30 reads with an additional requirement that at least two independent PCR replicates agree on the allele call for low-coverage loci. These numbers reflect each laboratory's empirically determined point at which allele calls become unreliable due to stochastic sampling effects at the bottom of the coverage distribution.
Balance thresholds define the minimum ratio of read counts between alleles at a heterozygous locus (interlocus balance) or between loci in a profile (interlocus balance). For a heterozygous STR locus, a read balance of 60:40 (the minor allele contributes at least 60% of the major allele's read count) is a common threshold for reporting a genuine heterozygous call rather than a homozygous call with background noise. Below 60:40, the locus may be flagged for review and the profile reported with a note. In mixture interpretation, the balance between contributors' alleles reflects their DNA input ratio, and the forensic-specific mixture software (probabilistic genotyping tools like STRmix and EuroForMix, now adapted for MPS input from FDSTools output) uses this balance information as the primary data for mixture deconvolution.
The SWGDAM 2020 guidelines on validation requirements for MPS in forensic DNA typing specify that validation studies must report: (1) concordance with CE typing for the same samples; (2) sensitivity (minimum input DNA at which a full profile is reliably obtained); (3) mixture studies demonstrating accurate allele calls at defined contributor ratios; (4) stochastic effect studies demonstrating the analytical thresholds below which drop-out is possible; and (5) reproducibility studies demonstrating consistent results across instruments, reagent lots, and analysts. These requirements mirror the ISO/IEC 17025 validation framework and the ENFSI DNA Working Group guidelines that European forensic laboratories follow.
A published validation study is the passport that takes an MPS workflow from a research paper to an accredited forensic result, and three independent validation components together make the passport valid.
The Verogen ForenSeq DNA Signature Prep kit on the MiSeq FGx is the most extensively published MPS system in forensic use. Initial validation by Zeng et al. (2015, Investigative Genetics) demonstrated full-profile concordance with CE at input DNA down to 250 pg. Subsequent validation by the FBI Laboratory (published 2021 in Forensic Science International: Genetics) across multiple analysts and instrument runs confirmed concordance at greater than 99.8% of called alleles. The New York OCME validation (2022) added mixture performance data, demonstrating accurate allele calls in two-person mixtures at contributor ratios from 1:4 to 4:1.
The Thermo Fisher Precision ID GlobalFiler NGS STR kit on the Ion S5 platform was validated by the Royal Canadian Mounted Police (RCMP) Laboratory (2019, FSI: Genetics) and by Queensland Health Forensic and Scientific Services, Australia (2020). The RCMP study demonstrated concordance across 500 samples and showed that sequence-based allele designation increased the discrimination power of the profile by approximately one order of magnitude compared to length-based CE typing, because sequence variants within STR repeats are resolved.
Accreditation of MPS-based forensic DNA typing under ISO/IEC 17025 has been achieved by: Verogen-certified laboratories in the US (OCME New York, Virginia DFS), NFI (Netherlands, accredited under the Dutch accreditation board RvA), the BKA (Germany, accredited under DAkkS), and Victoria Police (Australia, accredited under NATA). In each case, the accreditation assessment included review of the validation study, an internal audit of the analytical pipeline, a proficiency test, and a blind sample test using reference material from NIST's Standard Reference Material 2372a (human DNA quantitation standard).
A forensic MPS pipeline using BWA-MEM and GATK HaplotypeCaller produces a VCF file from a single-source reference sample. Coverage at locus CSF1PO is 22 reads. The laboratory's validated minimum coverage threshold is 30 reads. What is the correct analytical action?
| Allele calls for Precision ID kits |
| Precision ID GlobalFiler |
| Mulero et al. 2020 |