Bioinformatics Tools, Lecture 8

Random Numbers

Seeding the generator:

In older versions of Perl (before v5.05), programs had to explicitly call srand to seed the generator. So to be backward compatible, you should include the invocation. In newer versions, the first call to rand will automatically invoke srand. With this implicit invocation (and if you call srand with no argument) the seed is a combination of the current time of day with some OS values (like process ID, etc.) You should be careful to call srand without an argument only once during a program, as multiple invocations will actually decrease the "randomness" of the generator.

Actually, the most common use for srand by programmers is to generate the same pseudo-random sequence on successive invocations of a program. This is usually done for debugging purposes. Just make sure you remove that debugging code before final release!

If you want more "random" sequences (such as with cryptography) you should use an even more unpredictable seed. A common technique is:

srand (time ^ $$ ^ unpack "%L*", `ps axw | gzip`);

There are a number of things to explain in this piece of code. The time function simply returns the time of day as a number of seconds from some starting point. The ^ have the same meaning as in C—the bitwise exclusive-or operator. The $$ returns the process ID. The "back-pops" (backward apostrophes) are a call to the OS. The code between the pops is passed to the OS and executed. The resulting output, as a string, is the return value of the back-pops. In this case ps is a Unix command to list a bunch of information about the currently running processes on the system. The output of that command is then pipelined to gzip, which compresses it into a smaller string. This compressed string is the return value of the back-pops.

The pack function takes a list of values and converts it into a string, using rules specified by a "template". Here, the template is %L*. The L signifies a 32 bit long integer. The * has its usual regular-expression meaning. The unpack method is the inverse. In a scalar context, such as here, it returns just the first value retrieved from the string. So here it treats the first 4 bytes of the string returned by the back-pops as a 32 bit integer and returns that value. The cool thing about this is that it would be virtually impossible to replicate this seed.

Now to actually get a random number:

rand (numeric_expression)

will return a pseudo-random floating point number in the range [0,numeric_expression). Use the int() casting operator to get integers. So, for example,

int(rand(6))

will return a integer from 0 to 5, inclusive.

Generating Sets of Random DNA

Later, when we start looking at BLAST (a very common tool in bioinformatics for searching vast databases of DNA sequences to look for certain patterns and similarities) we will sometimes need to make statistical analyses of some of the returned data. This is sometimes accomplished by comparing BLAST's results with those we might obtain with random sequences. The program linked here is from Tisdall's book (Example 7-4) and determines the percentage of locations that are identical between two random DNA strings of the same length. (Not surprisingly, because there are four bases, running this program will usually return a value very near 25%.)

Searching with Hashes

OK, suppose you're studying a cell from the human body. Humans have around 30,000 genes. We want to know which of these genes are being expressed in the cell under certain conditions. (A gene is expressed if it is generating proteins.) A biologist will use a microarray facility to generate the expression information for that cell. This information will also indicate the degree to which the genes are expressed. We want to allow the user to determine whether a particular gene is expressed. We need to scan the expression data to answer this query.

Suppose our data indicates that 8000 genes were expressed. For each query, we have to look through this set. We could use binary search to do this, but usually we'll do better using a hash. But before we go there, a bit of biology background.

DNA is comprised of four nucleotides. Proteins are comprised of 20 amino acids. Sets of three nucleotides can express an amino acid or a stop signal. We call such sets-of-three codons. (The whole process is more complicated than this: first the DNA transcribes RNA. The RNA is then translated into proteins. Still, the simplification works fine for computer scientists!) The machinery of the cell starts at some point along the RNA and reads the sequence, codon after codon, attaching the generated amino acids to the sequence of them that has already been formed until a stop signal is reached. The resulting string of amino acids is a protein. (A partial string of amino acids is a peptide.)

Figure 8-1 of Tisdall's book shows the amino acid that is generated by each of the 64 possible codons. (Can you see why there are 64?) (Of course, because there are 64 codons but only 20 amino acids, there's a lot of redundancy.) Each amino acid can be represented by a single letter (beats me how the scientists naming the acids knew to start each name with a different letter!) To begin with we'll need a way to encode this table, i.e., to map the three character sequence of each codon to the single letter representring the corresponding amino acid. Obviously we could do this with a simple array or with pattern matching. See the codon2aa function in Tisdall's bioinformatics module (BeginPerlBioinfo.pm) for an example of that:

# From Chapter 8

# codon2aa
#
# A subroutine to translate a DNA 3-character codon to an amino acid
#   Version 1

# This subroutine is commented out because a preferred version of it
# follows.

# sub codon2aa {
#     my($codon) = @_;
#     
#        if ( $codon =~ /TCA/i )    { return 'S' }    # Serine
#     elsif ( $codon =~ /TCC/i )    { return 'S' }    # Serine
#     elsif ( $codon =~ /TCG/i )    { return 'S' }    # Serine
#     elsif ( $codon =~ /TCT/i )    { return 'S' }    # Serine
#     elsif ( $codon =~ /TTC/i )    { return 'F' }    # Phenylalanine
#     elsif ( $codon =~ /TTT/i )    { return 'F' }    # Phenylalanine
#     elsif ( $codon =~ /TTA/i )    { return 'L' }    # Leucine
#     elsif ( $codon =~ /TTG/i )    { return 'L' }    # Leucine
#     elsif ( $codon =~ /TAC/i )    { return 'Y' }    # Tyrosine
#     elsif ( $codon =~ /TAT/i )    { return 'Y' }    # Tyrosine
#     elsif ( $codon =~ /TAA/i )    { return '_' }    # Stop
#     elsif ( $codon =~ /TAG/i )    { return '_' }    # Stop
#     elsif ( $codon =~ /TGC/i )    { return 'C' }    # Cysteine
#     elsif ( $codon =~ /TGT/i )    { return 'C' }    # Cysteine
#     elsif ( $codon =~ /TGA/i )    { return '_' }    # Stop
#     elsif ( $codon =~ /TGG/i )    { return 'W' }    # Tryptophan
#     elsif ( $codon =~ /CTA/i )    { return 'L' }    # Leucine
#     elsif ( $codon =~ /CTC/i )    { return 'L' }    # Leucine
#     elsif ( $codon =~ /CTG/i )    { return 'L' }    # Leucine
#     elsif ( $codon =~ /CTT/i )    { return 'L' }    # Leucine
#     elsif ( $codon =~ /CCA/i )    { return 'P' }    # Proline
#     elsif ( $codon =~ /CCC/i )    { return 'P' }    # Proline
#     elsif ( $codon =~ /CCG/i )    { return 'P' }    # Proline
#     elsif ( $codon =~ /CCT/i )    { return 'P' }    # Proline
#     elsif ( $codon =~ /CAC/i )    { return 'H' }    # Histidine
#     elsif ( $codon =~ /CAT/i )    { return 'H' }    # Histidine
#     elsif ( $codon =~ /CAA/i )    { return 'Q' }    # Glutamine
#     elsif ( $codon =~ /CAG/i )    { return 'Q' }    # Glutamine
#     elsif ( $codon =~ /CGA/i )    { return 'R' }    # Arginine
#     elsif ( $codon =~ /CGC/i )    { return 'R' }    # Arginine
#     elsif ( $codon =~ /CGG/i )    { return 'R' }    # Arginine
#     elsif ( $codon =~ /CGT/i )    { return 'R' }    # Arginine
#     elsif ( $codon =~ /ATA/i )    { return 'I' }    # Isoleucine
#     elsif ( $codon =~ /ATC/i )    { return 'I' }    # Isoleucine
#     elsif ( $codon =~ /ATT/i )    { return 'I' }    # Isoleucine
#     elsif ( $codon =~ /ATG/i )    { return 'M' }    # Methionine
#     elsif ( $codon =~ /ACA/i )    { return 'T' }    # Threonine
#     elsif ( $codon =~ /ACC/i )    { return 'T' }    # Threonine
#     elsif ( $codon =~ /ACG/i )    { return 'T' }    # Threonine
#     elsif ( $codon =~ /ACT/i )    { return 'T' }    # Threonine
#     elsif ( $codon =~ /AAC/i )    { return 'N' }    # Asparagine
#     elsif ( $codon =~ /AAT/i )    { return 'N' }    # Asparagine
#     elsif ( $codon =~ /AAA/i )    { return 'K' }    # Lysine
#     elsif ( $codon =~ /AAG/i )    { return 'K' }    # Lysine
#     elsif ( $codon =~ /AGC/i )    { return 'S' }    # Serine
#     elsif ( $codon =~ /AGT/i )    { return 'S' }    # Serine
#     elsif ( $codon =~ /AGA/i )    { return 'R' }    # Arginine
#     elsif ( $codon =~ /AGG/i )    { return 'R' }    # Arginine
#     elsif ( $codon =~ /GTA/i )    { return 'V' }    # Valine
#     elsif ( $codon =~ /GTC/i )    { return 'V' }    # Valine
#     elsif ( $codon =~ /GTG/i )    { return 'V' }    # Valine
#     elsif ( $codon =~ /GTT/i )    { return 'V' }    # Valine
#     elsif ( $codon =~ /GCA/i )    { return 'A' }    # Alanine
#     elsif ( $codon =~ /GCC/i )    { return 'A' }    # Alanine
#     elsif ( $codon =~ /GCG/i )    { return 'A' }    # Alanine
#     elsif ( $codon =~ /GCT/i )    { return 'A' }    # Alanine
#     elsif ( $codon =~ /GAC/i )    { return 'D' }    # Aspartic Acid
#     elsif ( $codon =~ /GAT/i )    { return 'D' }    # Aspartic Acid
#     elsif ( $codon =~ /GAA/i )    { return 'E' }    # Glutamic Acid
#     elsif ( $codon =~ /GAG/i )    { return 'E' }    # Glutamic Acid
#     elsif ( $codon =~ /GGA/i )    { return 'G' }    # Glycine
#     elsif ( $codon =~ /GGC/i )    { return 'G' }    # Glycine
#     elsif ( $codon =~ /GGG/i )    { return 'G' }    # Glycine
#     elsif ( $codon =~ /GGT/i )    { return 'G' }    # Glycine
#     else {
#         print STDERR "Bad codon \"$codon\"!!\n";
#         exit;
#     }
# }



# From Chapter 8

# codon2aa
#
# A subroutine to translate a DNA 3-character codon to an amino acid
#   Version 2

# This subroutine is commented out because a preferred version of it
# follows.

# sub codon2aa {
#     my($codon) = @_;
#  
#        if ( $codon =~ /GC./i)        { return 'A' }    # Alanine
#     elsif ( $codon =~ /TG[TC]/i)     { return 'C' }    # Cysteine
#     elsif ( $codon =~ /GA[TC]/i)     { return 'D' }    # Aspartic Acid
#     elsif ( $codon =~ /GA[AG]/i)     { return 'E' }    # Glutamic Acid
#     elsif ( $codon =~ /TT[TC]/i)     { return 'F' }    # Phenylalanine
#     elsif ( $codon =~ /GG./i)        { return 'G' }    # Glycine
#     elsif ( $codon =~ /CA[TC]/i)     { return 'H' }    # Histidine
#     elsif ( $codon =~ /AT[TCA]/i)    { return 'I' }    # Isoleucine
#     elsif ( $codon =~ /AA[AG]/i)     { return 'K' }    # Lysine
#     elsif ( $codon =~ /TT[AG]|CT./i) { return 'L' }    # Leucine
#     elsif ( $codon =~ /ATG/i)        { return 'M' }    # Methionine
#     elsif ( $codon =~ /AA[TC]/i)     { return 'N' }    # Asparagine
#     elsif ( $codon =~ /CC./i)        { return 'P' }    # Proline
#     elsif ( $codon =~ /CA[AG]/i)     { return 'Q' }    # Glutamine
#     elsif ( $codon =~ /CG.|AG[AG]/i) { return 'R' }    # Arginine
#     elsif ( $codon =~ /TC.|AG[TC]/i) { return 'S' }    # Serine
#     elsif ( $codon =~ /AC./i)        { return 'T' }    # Threonine
#     elsif ( $codon =~ /GT./i)        { return 'V' }    # Valine
#     elsif ( $codon =~ /TGG/i)        { return 'W' }    # Tryptophan
#     elsif ( $codon =~ /TA[TC]/i)     { return 'Y' }    # Tyrosine
#     elsif ( $codon =~ /TA[AG]|TGA/i) { return '_' }    # Stop
#     else {
#         print STDERR "Bad codon \"$codon\"!!\n";
#         exit;
#     }
# }

Although this code is easy enough to understand, the run-time of those methods would be linear in the number of patterns to look for. Instead, we'll use a hash.

Recall the basics of a hash. They are declared much like arrays except the identifier starts with a '%' instead of a '@'. We use the => operator to assign keys to values. So, the code

    my(%genetic_code) = (
    
    'TCA' => 'S',    # Serine
    'TCC' => 'S')    # Serine

creates the hash genetic_code that will map the key 'TCA' to the value 'S', and 'TCC' to 'S' also. To access the hash we use the {} operator, similar to the way the [] operator works with arrays. So

$genetic_code{'TCC'}

returns 'S'. We can check if a key is in the hash with the exists operator:

if (exists $genetic_code{$key})...

This code, then, demonstrates how we can encode the mapping using a hash. Because the function is based on the single use of a hash, its run-time will be constant.

Perl has a built-in mechanism, DBM, for storing hashes in files.

Reading DNA data (FASTA formatting)

A great amount of the information regarding known DNA sequences is stored in FASTA format in biological databases. FASTA files are also used to encode protein sequences. FASTA is a flat data format, meant to be human-readable. The files usually start with several lines of comments, each denoted by the first character being a ">". Each line is usually fewer than 80 characters, so DNA sequences are broken into chunks. When we read the file, we need to reassemble the chunks of each DNA strand into a single string. Here's some code that does that. The foreach loop in the code provides a nice example of Perl's capabilities for iterating over collections.

OK, now that we've read the DNA sequence we want to look for coding regions within the sequence, those portions of the DNA that actually expresses proteins. The problem is that we don't really know where the codons start. There are 6 possible reading frames. Why six? Given the sequence ...CTGA..., the codon boundary might start on the C, the T or the G. Moreover, when DNA sequences are listed, we only list one half of the sequence, omitting the reverse complement. So the proteins might be being generated by the reverse complement. Thus, for any DNA sequence, we get 6 reading frames. The problem is to identify which of the frames supports significant coding regions, regions that offer long chains of amino acids without a stop signal.

Here's code to print the peptides generated by each of the 6 reading frames of a given strand of DNA. You can try running Tisdall's example8-4.pl to see this in action.