Standard IUB/IUPAC Amino and Nucleic Acid Codes

Borrowed from Tisdall's Beginning Perl for Bioinformatics, pp.30-1:

For expediency, the names of the nucleic acids and the amino acids are often represented as one- or three-letter codes, as shown in Table 4-1 and Table 4-2. (This book mostly uses the one-letter codes for amino acids.)

Table 4-1. Standard IUB/IUPAC nucleic acid codes

Code

Nucleic Acid(s)

A

Adenine

C

Cytosine

G

Guanine

T

Thymine

U

Uracil

M

A or C (amino)

R

A or G (purine)

W

A or T (weak)

S

C or G (strong)

Y

C or T (pyrimidine)

K

G or T (keto)

V

A or C or G

H

A or C or T

D

A or G or T

B

C or G or T

N

A or G or C or T (any)

Table 4-2. Standard IUB/IUPAC amino acid codes

One-letter code

Amino acid

Three-letter code

A

Alanine

Ala

B

Aspartic acid or Asparagine

Asx

C

Cysteine

Cys

D

Aspartic acid

Asp

E

Glutamic acid

Glu

F

Phenylalanine

Phe

G

Glycine

Gly

H

Histidine

His

I

Isoleucine

Ile

K

Lysine

Lys

L

Leucine

Leu

M

Methionine

Met

N

Asparagine

Asn

P

Proline

Pro

Q

Glutamine

Gln

R

Arginine

Arg

S

Serine

Ser

T

Threonine

Thr

V

Valine

Val

W

Tryptophan

Trp

X

Unknown

Xxx

Y

Tyrosine

Tyr

Z

Glutamic acid or Glutamine

Glx

The nucleic acid codes in Table 4-1 include letters for the four basic nucleic acids; they also define single letters for all possible groups of two, three, or four nucleic acids. In most cases in this book, I use only A, C, G, T, U, and N. The letters A, C, G, and T represent the nucleic acids for DNA. U replaces T when DNA is transcribed into ribonucleic acid (RNA). N is the common representation for "unknown," as when a sequencer can't determine a base with certainty. Note that the lowercase versions of these single-letter codes is also used on occasion, frequently for DNA, rarely for protein.