Chromosomes, Genes and the Human Genome
The human genome contains roughly three billion base pairs of DNA, organised into 23 pairs of chromosomes and encoding approximately 20,000 protein-coding genes alongside vast non-coding regions. Forensic DNA profiling exploits specific variable loci within this genome to generate profiles that can link biological evidence to individuals with high statistical confidence.
Last updated:
DNA in a human cell does not float freely in the nucleus as a single strand. It is compacted into chromosomes: discrete structures in which a long DNA molecule is wrapped around spool-like histone proteins, coiled further, and eventually condensed into the rod-shaped forms visible under a light microscope during cell division. The human genome is divided across 23 pairs of chromosomes, giving each cell 46 chromosomes in total and roughly 3.2 billion base pairs of DNA. Within that sequence sit approximately 20,000 protein-coding genes, which together make up only about 1.5 percent of the genome. The remaining sequence includes regulatory elements, non-coding RNA genes, repetitive elements, introns, and regions whose function is still being characterised. It is precisely within these non-coding repetitive regions that forensic scientists find the variable loci used for DNA profiling.
Understanding the genome's architecture matters for forensic practice for two reasons. First, it explains why forensic loci are chosen from non-coding regions: profiling in these regions avoids revealing medical information about a contributor, reducing privacy concerns and supporting the admissibility of profiles in courts from India to the United States to the European Union. Second, the scale and structure of the genome determines how individual variation arises and how rare a specific combination of alleles will be in a given population. A forensic scientist who can explain why certain loci were chosen, what the genome looks like around those loci, and how population genetics predicts the frequency of a profile is a scientist who can give reliable and defensible evidence.
Forensic biology entered the genomic era in 1984 when Alec Jeffreys at the University of Leicester developed DNA fingerprinting using variable number tandem repeats (VNTRs). Modern profiling uses short tandem repeats (STRs), which are shorter, more abundant, and can be amplified from degraded or trace-quantity samples using PCR. The Combined DNA Index System (CODIS) in the United States standardised on 20 core STR loci; the UK National DNA Database and the European Standard Set of loci adopt overlapping but distinct panels. India's national DNA database framework, under the DNA Technology (Use and Application) Regulation Bill, proposes a panel consistent with international standards. All these systems rest on the same genomic foundation: variable, non-coding, highly polymorphic loci spread across multiple chromosomes.
By the end of this topic you will be able to:
- Describe how DNA is packaged from the double helix into nucleosomes, chromatin fibres, and ultimately chromosomes.
- Distinguish coding genes from non-coding regions and explain the functional importance of each in the context of the genome.
- State the approximate size of the human genome in base pairs and the number of protein-coding genes, and explain why non-coding regions dominate.
- Define genetic polymorphism and explain why STR loci in non-coding regions are the preferred targets for forensic DNA profiling.
- Explain the role of the sex chromosomes in determining biological sex and in forensic applications such as Y-STR profiling.
- Chromosome
- A discrete, condensed structure consisting of a single long DNA molecule complexed with histone proteins. Humans normally carry 46 chromosomes in 23 homologous pairs in somatic cells; germline cells (eggs and sperm) carry 23 unpaired chromosomes.
- Gene
- A segment of DNA that encodes a functional product, most commonly a protein but also non-coding RNA molecules. Protein-coding genes occupy roughly 1.5 percent of the human genome sequence; the rest is non-coding.
- Locus (plural: loci)
- A defined physical position on a chromosome. In forensic genetics, 'locus' usually refers to a specific STR site used in profiling. The different sequence variants found at a locus across a population are called alleles.
- Short tandem repeat (STR)
- A DNA sequence consisting of a short motif (typically 2 to 6 base pairs) repeated in tandem multiple times. The number of repeats varies between individuals, making STR loci highly polymorphic and ideal for forensic discrimination.
- Polymorphism
- The existence of two or more sequence variants (alleles) at a locus in a population at a frequency above one percent. High polymorphism at a locus means that different individuals are likely to carry different alleles, which is what makes a locus useful for individualisation.
- Heterozygosity
- The proportion of individuals in a population who carry two different alleles at a locus. A locus with high heterozygosity is more informative for forensic profiling than one where most people carry the same allele. CODIS loci were selected partly for their high heterozygosity values.
DNA packaging: from double helix to chromosome
If the DNA in a single human cell were stretched out, it would measure approximately two metres in length. That molecule must fit inside a nucleus roughly six micrometres across. The cell achieves this through a hierarchy of compaction, and understanding that hierarchy explains both why chromosomes exist as discrete structures and why forensic samples can contain intact chromosomal DNA even from dried biological material.
The first level of compaction is the nucleosome. A section of about 147 base pairs of DNA wraps roughly 1.65 times around a core of eight histone proteins (two each of H2A, H2B, H3, and H4). The result looks like beads on a string when viewed by electron microscopy. This structure reduces the effective length of the DNA by about seven-fold. A linker histone (H1) stabilises the connection between adjacent nucleosomes. At the next level, nucleosomes are coiled into a 30-nanometre chromatin fibre, achieving a further six-fold compaction. Further looping and scaffolding compress the fibre into the classic X-shaped chromosome visible during mitosis, where the total compaction factor reaches approximately 10,000-fold relative to the naked DNA.
This compaction is dynamic. During interphase (the non-dividing state of a cell), chromosomes are partially decondensed and genes are accessible to the transcription machinery. Regions that are being actively transcribed tend to be in more open (euchromatic) configurations; silenced regions are in tighter (heterochromatic) configurations. For forensic biology, the important point is that most cells in biological evidence, including dried bloodstains, epithelial cells in saliva, and spermatozoa, contain intact chromosomes in their nuclei, and the DNA within those chromosomes can be extracted and analysed by standard laboratory methods.
The human genome: scale, structure, and gene content
The Human Genome Project, completed in draft in 2001 and in finished sequence by 2003, established that the haploid human genome contains approximately 3.2 billion base pairs of DNA. The diploid genome in most somatic cells therefore contains about 6.4 billion base pairs. Distributed across 23 chromosomes, the chromosomes range in size from chromosome 1 (about 249 million base pairs) to the Y chromosome (about 57 million base pairs, much of which is repetitive heterochromatin).
Protein-coding genes number approximately 20,000, and their exons (the portions that are translated into protein) together account for roughly 1.5 percent of the genome sequence. The rest is composed of introns (non-coding intervening sequences within genes), regulatory elements (promoters, enhancers, silencers), non-coding RNA genes (including microRNAs and long non-coding RNAs), repetitive sequences, transposable elements, and sequences with no currently assigned function. This non-coding majority is not 'junk': much of it is transcribed, some of it regulates gene expression, and it is within this fraction that the highly polymorphic STR loci used in forensic profiling are found.
| Chromosome | Approximate size (Mbp) | Approximate gene count | Notable forensic feature |
|---|---|---|---|
| 1 | 249 | 2,000 | Largest autosome; many STR loci |
| 7 | 159 | 1,150 | Contains D7S820, a core CODIS locus |
| 13 | 114 | 320 | Contains D13S317, a core CODIS locus |
| X | 155 | 800 | X-STR profiling; sex determination |
| Y | 57 | 70 | Y-STR paternal lineage tracing; male sex determination |
The 22 autosomes are numbered roughly by size (chromosome 1 being the largest), and each individual carries two copies, one inherited from each parent. The 23rd pair is the sex chromosomes: females carry two X chromosomes and males carry one X and one Y. This distinction has direct forensic relevance. Autosomal STR profiling uses markers on the 22 autosomes and can identify individuals regardless of sex. Y-chromosome STR profiling traces the paternal line. X-chromosome STR profiling follows inheritance rules intermediate between autosomal and Y-linked, and is particularly useful in kinship analysis.
Coding genes and non-coding regions: a functional map
A protein-coding gene has a defined structure. Moving along the DNA in the direction of transcription, a gene contains a promoter region (where transcription factors and RNA polymerase bind), a 5' untranslated region, one or more exons (coding sequences), introns separating the exons, a 3' untranslated region, and a polyadenylation signal. When the gene is expressed, the entire stretch from the 5' cap to the poly-A tail is transcribed as pre-mRNA; the introns are then spliced out to produce mature mRNA, which is exported from the nucleus and translated into protein.
Introns are non-coding sequences within genes, but they are not unimportant. Some introns contain regulatory elements; alternative splicing of introns can produce multiple protein variants from a single gene. The intronic fraction of the genome is larger than the exonic fraction, and intronic sequences are where many STR loci are located. An STR locus sitting inside an intron is not part of the mRNA; it has no effect on the protein product; and variation in the number of repeats does not alter the gene's function. This is why such loci are attractive for forensic profiling.
Beyond introns, the genome contains large intergenic regions, which are the stretches between genes. These regions include regulatory sequences that control gene expression over long distances, non-coding RNA genes, the centromeres (repetitive regions at the physical centre of each chromosome that are critical for cell division), and the telomeres (protective caps at chromosome ends). Repetitive elements, including satellite DNA, minisatellites, and microsatellites (another name for STRs), are distributed throughout. The density of STR loci is roughly one per every few thousand base pairs, giving the genome millions of potential STR sites, of which a carefully chosen subset is used in forensic panels.
Genetic variation and why it arises
No two people (other than identical monozygotic twins) have the same genomic sequence. The differences arise from several mechanisms. Single nucleotide polymorphisms (SNPs) are positions where a single base differs between individuals; there are roughly 4 to 5 million SNPs distinguishing any two unrelated people. Insertions and deletions (indels) occur when short stretches of sequence are present in some individuals but absent in others. Copy number variants (CNVs) are larger regions, sometimes encompassing whole genes, that are duplicated or deleted in some people. STRs vary in the number of times a short motif is repeated: a locus with the motif AGAT repeated 7 times in one person and 12 times in another illustrates allelic variation at that locus.
STR variation arises primarily through replication slippage. During DNA replication, the nascent strand can temporarily dissociate and re-anneal out of register on the template strand at a repeat region. This produces a loop in either the template or the nascent strand, resulting in the insertion or deletion of one or more repeat units in the daughter molecule. Because this happens with measurable frequency at STR loci, the number of repeats at a given STR site drifts over generations, generating the allelic diversity that makes these loci polymorphic.
Population genetics determines how useful a given locus is. If an allele is very common, finding it in crime-scene evidence and in a suspect tells us little, because many people in the population carry it. If an allele is rare, finding it in both samples is more informative. Forensic panels are designed to maximise the product of allele frequencies across multiple independent loci, producing a combined match probability that can be extremely small, often less than one in a billion for a full 20-locus CODIS profile.
Selecting loci for forensic profiling: the criteria
Not every polymorphic locus in the genome is suitable for forensic profiling. Selection involves balancing several technical and ethical criteria. The Scientific Working Group for DNA Analysis Methods (SWGDAM) in the United States, the European DNA Profiling Group (EDNAP), and equivalent bodies in other jurisdictions developed the criteria that underpin current panels.
- High polymorphism: the locus must have many alleles in the population, with no single allele dominating. This is measured by heterozygosity. CODIS loci typically have heterozygosity values above 0.7.
- Non-coding location: the locus must not encode medical, phenotypic, or other sensitive personal information. Placing profiling loci in introns or intergenic regions achieves this.
- PCR amplifiability from degraded samples: the target region must be short enough to survive degradation and be amplified reliably. Most forensic STR amplicons are under 400 base pairs; newer panels use even shorter amplicons for degraded samples.
- Low mutation rate: the locus must be stable enough that parent-child allele sharing is reliable for kinship analysis, typically a mutation rate below 1 in 500 transmissions.
- Autosomal independence: loci on different chromosomes (or far apart on the same chromosome) assort independently at meiosis, so allele combinations can be treated as statistically independent, allowing the multiplication rule in profile frequency calculations.
The CODIS 20-locus panel, expanded from 13 loci in 2017, includes loci distributed across 15 different chromosomes. The European Standard Set of 17 loci, defined by the European Network of Forensic Science Institutes (ENFSI), overlaps substantially with CODIS and has been updated to align with the CODIS expansion. India's proposed panel under the DNA Technology Regulation Bill follows a similar design philosophy. Harmonisation of locus panels across jurisdictions matters for transnational investigations, allowing DNA profiles generated in one country to be compared with databases in another.
For a deeper view of how profiling fits into forensic biotechnology more broadly, see the Forensic Biotechnology subject, and for how profiling interacts with serological identification of biological fluids, see Forensic Serology.
Sex chromosomes and mitochondrial DNA in forensic context
The sex chromosomes, X and Y, follow different inheritance rules from the autosomes, and those differences produce specialised forensic tools. The Y chromosome is inherited almost entirely unchanged from father to son, because it undergoes very little recombination. A set of Y-STR markers therefore produces a haplotype that is shared by all males in a paternal lineage. This makes Y-STR profiling useful for tracing a male lineage in historical cases or for identifying a male contributor in a complex mixture containing predominantly female DNA, as in some sexual assault cases where vaginal cells swamp the male fraction.
The limitation of Y-STR profiling follows directly from the inheritance rule: a Y haplotype cannot distinguish between brothers, a father and his sons, or any other males in the same paternal lineage. A match between crime-scene Y-STR evidence and a suspect's Y haplotype implicates the lineage, not the individual. Courts in the United States, United Kingdom, and European jurisdictions have grappled with how to present Y-STR evidence to juries; the standard approach is to report a haplotype frequency in the relevant population database rather than a match probability.
Mitochondrial DNA (mtDNA) sits outside the nucleus, in the mitochondria, and is inherited maternally through the egg cytoplasm. Each cell contains hundreds to thousands of mitochondria, each with multiple copies of the circular 16,569-base-pair mtDNA genome. This copy number advantage makes mtDNA analysis valuable when nuclear DNA is absent or severely degraded: shed hairs without roots, ancient bones, and highly burnt remains may yield mtDNA when nuclear STR profiling fails. Like Y-STR, mtDNA sequence types are shared by matrilineal relatives and cannot distinguish between them. The forensic analysis of mtDNA focuses on the hypervariable regions (HV1 and HV2) of the control region, where sequence variation between individuals is greatest.
What percentage of the human genome consists of protein-coding gene exons?
Key Takeaways
- The human genome contains approximately 3.2 billion base pairs, organised into 23 pairs of chromosomes through a hierarchy of compaction from nucleosome beads to the fully condensed chromosome visible during cell division.
- Only about 1.5 percent of the genome encodes protein-coding genes. The rest includes introns, regulatory elements, non-coding RNA genes, repetitive sequences, and intergenic regions, many of which are functionally important.
- STR loci in non-coding regions are preferred for forensic profiling because they are highly polymorphic, reveal no sensitive personal information, produce short PCR amplicons that survive degradation, and are statistically independent across different chromosomes.
- New alleles at STR loci arise mainly through replication slippage, which inserts or deletes repeat units at a rate of roughly 1 in 500 to 1 in 1,000 transmissions, generating population diversity while maintaining parent-child allele sharing.
- Sex chromosomes provide specialised forensic tools: Y-STR profiling traces the paternal lineage and is useful in male-contributor analysis; mitochondrial DNA, inherited maternally, is valuable when nuclear DNA is absent because its high copy number per cell improves recovery from degraded evidence.
What is the difference between a chromosome and a gene?
How many base pairs are in the human genome?
Why do forensic scientists use STR loci rather than sequencing the whole genome?
What is genetic variation and why does it matter in forensics?
What are the sex chromosomes and how are they used in forensics?
Test yourself on Forensic Biology with free, timed mocks.
Practice Forensic Biology questionsSpotted an error in this page? Report a correction or read our editorial standards.