The Genetic Code | Richard Crooks's Website

The discoveries that DNA is the genetic material that contains the instructions for how to construct an organism, and that it is structured as a linear polymer of nucleotides were important for the modern understanding of genetics. However these discoveries alone didn't make is possible to fully utilize genetic science. It was necessary to discover how the information in the DNA sequence is read by organisms. The elucidation of the code was made in the decade following the discovery of the structure of DNA, and culminated in the 1968 award of the Nobel Prize for Physiology and Medicine to Har Gobind Khorana, Robert W. Holley and Marshall Nirenberg.

The way that the DNA sequence is read by organisms is called the genetic code. The genetic code is universal across all organisms. While organisms have differences in how they store and process DNA, the genetic code is universal across all life forms.

As per Crick's dogma, DNA is transcribed to mRNA, which is further translated to protein. There are twenty amino acids found in proteins, but only four nucleotides. Because there are fewer nucleotides than amino acids, this means that groups of nucleotides need to encode for each amino acid. Groups of two nucleotides would only produce sixteen possible combinations (4^2 = 16), however groups of three nucleotides would produce sixty four combinations (4^3 = 64), easily sufficient to encode for each amino acid. Each group of three nucleotides is called a codon, and each codon encodes for a specific amino acid, or a signal to stop protein translation (Figure 1).

A codon translation table — Figure 1: A common representation of the genetic code as a codon table of RNA sequences. Amino acid translations, and the properties of the amino acids are colour coded non-polar (grey), polar neutral (green), acidic (red), basic (blue). Amino acids generally cluster so that the first two nucleotides in alls the codons for the amino acid are the same, while the third varies, and particular types of amino acid also cluster, this makes the DNA sequence somewhat more resistant to mutations than it would otherwise be if the codons for an amino acid were spread out across the codon table.

Because there are more possible codons than there amino acids, there is a certain amount of redundancy in the genetic code, with multiple codons encoding for the same amino acids. As well as this redundancy, there is an imbalance in the number of codons that encode for each amino acid. Alanine has six codons encoding it, while tryptophan has only one. This disparity reflects the frequency of different amino acids, those amino acids that occur more frequently have more possible codons, those that occur less frequently have fewer codons encoding them (Figure 2).

A graph showing the correlation between frequency of different codons and amino acid frequency — Figure 2: The percentage of each amino acid in the UniProtKB/TrEMBL release 2023_02 of 03-May-2023 (249308459 sequence entries, comprising 86853323495 amino acids) compared with the number of codons that encode for each amino acid. There is a pattern that the higher the frequency of amino acids, the more codons that encode for it.

Codon redundancy means that mutations may not change the sequence of amino acids. Generally speaking, codons that express the same amino acid will have the same first two nucleotides of the codon sequence while the last nucleotide will vary (Figure 1). Because of this, a mutation at the last nucleotide will not change the amino acid. This provides some protection of the genetic sequence against mutation. Codon redundancy also gives rise to the phenomenon of codon usage. Codon usage is where organisms appear to have preferences for which codons they use for encoding certain amino acids. These usage patterns vary between organisms (Figure 3) and may reflect regulating how the protein product folds, or translational efficiency.

Pictures of codon usage tables in humans and e coli — Figure 3: Differences in codon usage (based on codon frequency per 1,000 codons) between human (left) and Escherichia coli (right). The three stop codons are highlighted in yellow. In addition, the UCG, CCG, ACG, GCG, CGU and CGC codons are considered rare in humans, while the CUA, AUA, CCC, CGA, AGA and AGG. Codon usage varies between species, and so it is important to consider this during genetic engineering methods that move genes between organisms, as the gene may not translate as efficiently if moved to a different species, and they may need to be edited. Whether a codon is considered rare or not is not only based on the frequency, but rather has to consider the expected frequency of the codon based on the frequency of each nucleotide, and the frequency of each amino acid, thus there are some codons for amino acids that are considered rare even though other codons for those amino acids occur less frequently than them.

The universality of the genetic code is what enables modern genetics. Scientists can analyze the sequences of genes in multiple different organisms to predict the function of those genes through homology. Genetic engineering using molecular biology techniques can also be used to move genes from one organism to another. This allows scientists to produce proteins from humans (which would be hard to extract in large quantities) into bacteria to produce such proteins for protein function research. Scientists can also produce transgenic organisms with different genes from other organisms.