BLAST

BLAST is one of the most important tools in the Bioinformaticist's tool chest. It is a tool for identifying matches of a given biological sequence to a database of other sequences. For example, a researcher may have identified a sequence of DNA and want to know if this sequence matches any known DNA sequence. BLAST is actually an acronym for a string matching algorithm, and there are many flavors and implementations of this algorithm. Some of these implementations are free--you could download them and run your own BLAST server locally. That would require a fair amount of memory to hold the large databases of sequences, and a fair amount of processing power to effect the searches. For now we'll be looking at the freely accessible BLAST server at the National Center for Biotechnology Information (NCBI) at http://www.ncbi.nlm.nih.gov/BLAST/.

As an example, go to this site, and access the nucleotide-nucleotide server (BLASTN). Try copying the first few lines of the DNA sequence in our sample FASTA file we looked at earlier, sample.dna, into the BLAST search engine and see what you get.

OK, so what is a BLAST "match"? Biologists are interested in similarity as an indication of homology. Similarity between queries and database sequences is measured by the percent identity, which is the number of bases in the query that exactly match those of the sequence. Another metric is the degree of conservation, which detects matches between codons or amino acid residues. In this case, the match can be in terms of what the codons express (recall that some peptides are expressed by several different codons.) I.e., the sequences can differ, so long as they express the same proteins.

BLAST will return a set of possible matches, and some statistics that identify the probability of the match being genuine. The statistics consist of three parts: a raw score, S, a measure of similarity and size of match. The expect value, E, which is the probability that the same match might occur in a randomly generated DB of the same size. The closer to 0.0, the more likely that the match is genuine. Values less than 1 can be considered a "solid hit", and values less than 10 are possibly worth looking into.

BLAST output

The file blast.txt is sample output generated by the BLASTN server above. In examining it, we can see that the output, though long, is separated into three parts: the beginning annotation (everything preceding "ALIGNMENTS"), the alignments (preceding "Database"), and the ending annotation. The program example12-1.pl parses the sample file. The first part of the parse_blast subroutine extracts these three parts into separate variables.

The trickiest part of the whole process is the parsing of the alignments, which is done in the parse_blast_alignment subroutine. There you will find this complicated regular expression:

while($alignment_section =~ /^>.*\n(^(?!>).*\n)+/gm) {

Most of the regexp is straightforward, but there is one new feature: (?!>). This is a negative lookahead assertion. The intent is to ensure that the next line does not begin with a >. The next somewhat complicated piece of code is the line

my($key) = (split(/\|/, $value)) [1];

Can you see what that does? Use the debugger to break at this line and take a look at the value of $&. The third time through the while loop (thus, accessing the third alignment in the alignment section), we get this from the debugger:

  DB<5> p $key
AK017941.1
  DB<6> p $value
>dbj|AK017941.1|AK017941 Mus musculus adult male thymus cDNA, RIKEN full-length
enriched
            library, clone:5830420C16, full insert sequence
          Length = 1461

 Score =  210 bits (106), Expect = 5e-52
 Identities = 151/166 (90%)
 Strand = Plus / Plus


Query: 235  ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg 294
            |||||||| |||||||  || ||||| ||||||||||| ||||||||||| |||||||||
Sbjct: 1048 ggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaatggg 1107


Query: 295  gagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcggg 354
            ||||| || ||||||||||||||||| ||||||||||||||||| ||||||||||| ||
Sbjct: 1108 gagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatgattgga 1167


Query: 355  tgtgacaactgcaatgagtggttccatggggactgcatccggatca 400
            |||||||||||||| |||||||||||||| ||||||||||||||||
Sbjct: 1168 tgtgacaactgcaacgagtggttccatggagactgcatccggatca 1213
 
 Score = 44.1 bits (22), Expect = 0.066
 Identities = 25/26 (96%)
 Strand = Plus / Plus


Query: 118 gcggaagtagttgtgggcgcctttgc 143
           ||||||||||||| ||||||||||||
Sbjct: 235 gcggaagtagttgcgggcgcctttgc 260

 

This shows two high-scoring pairs (HSP's) for the database sequence labelled AK017941.1. Each HSP is composed of two parts, an annotation, and the matching pair itself. The latter shows the respective positions of the matching substrings in the query sequence and in the sequence from the server's database. So here, there is a match of a substring starting at position 235 in the query with a substring starting at position 1048 in the BLAST database sequence AK017941.1.

Program example12-2.pl provides a method, parse_blast_alignment_HSP for parsing the alignment section of BLAST output into the HSP's. Given an alignment section, it returns an array of strings, each of which is the HSPs for one database sequence. (Actually, element [0] is the preamble of the alignment that precedes the first HSP. In our example above, it is everything from the ">dbj" up to just before the first appearance of the word "Score".) We will use "Score" as the delimiter for the HSPs. Thus the matching operation:

    ($beginning_annotation, $HSP_section )
        = ($alignment =~ /(.*?)(^ Score =.*)/ms);

extracts the preamble just mentioned. Recall that the ? at the end of the regular expression (.*?) modifies the * to be "minimal". Thus, .*? will match the shortest substring that can be found that precedes " Score". The default behavior would be to find the longest such string. The next section of the program

    while($HSP_section =~ /(^ Score =.*\n)(^(?! Score =).*\n)+/gm) {
        push(@HSPs, $&);

extracts each matching pair (the "Query"/"Sbjct" sets), along with its header information ("Score"....) The ?! at the beginning of embedded parentheses is the negative lookahead assertion again. So (^(?! Score =).*\n)+ will match any number of lines that do not start with "Score =". (The "lookahead" aspect of this operator means that when a line starting with "Score =" is finally found, it will not be "consumed", but remain for the next iteration of the global search.)

Use the debugger to see what parse_blast_alignment_HSP returns.

Okay, now that we've got the HSPs, the program parses one of them, as an example, via a call to extract_HSP_information. This subroutine has a number of interesting features. To begin with, the given argument to the subroutine (@_) is one match of an alignment, such as:

 Score =  210 bits (106), Expect = 5e-52
 Identities = 151/166 (90%)
 Strand = Plus / Plus


Query: 235  ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg 294
            |||||||| |||||||  || ||||| ||||||||||| ||||||||||| |||||||||
Sbjct: 1048 ggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaatggg 1107


Query: 295  gagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcggg 354
            ||||| || ||||||||||||||||| ||||||||||||||||| ||||||||||| ||
Sbjct: 1108 gagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatgattgga 1167


Query: 355  tgtgacaactgcaatgagtggttccatggggactgcatccggatca 400
            |||||||||||||| |||||||||||||| ||||||||||||||||
Sbjct: 1168 tgtgacaactgcaacgagtggttccatggagactgcatccggatca 1213

We want to form the entire matching string from the query sequence by concating its pieces. This statement does that:

    $query = join ( '' , ($HSP =~ /^Query(.*)\n/gm) );

Recall that in a list context, the =~ operator returns an array of all the matches. The 'g' modifier means that we'll get an array consisting of each match for (.*), i.e:

  DB<15> @bob = ($HSP =~ /Query(.*)\n/gm)

  DB<16> p "@bob"
: 235  ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg 294 : 295  g
agaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcggg 354 : 355  tgtgacaac
tgcaatgagtggttccatggggactgcatccggatca 400
  DB<17> p scalar @bob
3
  DB<18> p $bob[0]
: 235  ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg 294

[Ignore the linebreaks. They're just necessary to fit the text onto this page.] The join operator merely concatenates all the elements of the array into a single string. All that remains is to delete all the characters that are not DNA letters.

The subroutine then extracts the range of the match. This is a pair consisting of the indices into the sequence defining the match. The range for the match in the query string is 235..400. But those numbers are simply at the head an tail of $query, so we get them with

    $query_range = join('..', ($query =~ /(\d+).*\D(\d+)/s));

The debugger shows what's going on:

  DB<2> p $query
: 235  ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg 294: 295  ga
gaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcggg 354: 355  tgtgacaactg
caatgagtggttccatggggactgcatccggatca 400
  DB<3> $sue = ($query =~ /(\d+).*\D(\d+)/s)

  DB<4> p $sue
1
  DB<5> @sue = ($query =~ /(\d+).*\D(\d+)/s)

  DB<6> p "@sue"
235 400

The join simply inserts the '..' between the elements of the array in a concatenation.

Formatting Output

There are several mechanisms for formatting output in Perl. We've already briefly looked at one of them, the printf function, which is borrowed from the C programming langauge. You can view the documentation for printf via perldoc, though that will probably just redirect you to the sprintf function, which returns a formatted string rather than printing one.

If you want to generate multiline output, such as a form, many times, then you will want to use either the here construct, or format and write. Example 12-3, from Tisdall's book shows usage of a here document:

Example 12-3. Example of here document
#!/usr/bin/perl
# Example of here document

use strict;
use warnings;

my $DNA = 'AAACCCCCCGGGGGGGGTTTTTT';

for( my $i = 0 ; $i < 2 ; ++$i ) {
print <<HEREDOC;
     On iteration $i of the loop!
    $DNA

HEREDOC
}

exit;

Here's the output:

On iteration 0 of the loop!
AAACCCCCCGGGGGGGGTTTTTT

On iteration 1 of the loop!
AAACCCCCCGGGGGGGGTTTTTT

Example 12-4, from Tisdall's book shows use of the format & write technique, which is borrowed from Fortran:

Example 12-4. Example of format function to produce FASTA output
#!/usr/bin/perl
# Create fasta format  DNA output with "format" function

use strict;
use warnings;

# Declare variables
my $id = 'A0000';
my $description = 'Highly weird DNA.  This DNA is so unlikely!';
my $DNA = 'AAAAAACCCCCCCCCCCCCCGGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTTTTTTTTTTTTT';

# Define the format
format STDOUT =
# The header line
>@<<<<<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<...
$id,        $description
# The DNA lines
^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<~~
$DNA
.

# Print the fasta-formatted DNA output
write;

exit;

Here's the output:

>A0000      Highly unlikely DNA.  This DNA is so...
AAAAAACCCCCCCCCCCCCCGGGGGGGGGGGGGGGGGGGGGGTTTTTTTTT
TTTTTTTTTTTT

After declaring and initializing the variables that fill in the form, the form is defined with:

 format STDOUT =

and the format continues until it reaches the line with a period at the beginning.

The format is composed of three kinds of lines:

The picture line and the argument line must be adjacent; they can't be separated by a comment line, for instance.

BIOPERL

There are a great many modules defined for Bioinformatics processing. You should download and install Bioperl from www.bioperl.org to try these out. (You may also need to go to www.cpan.org to get some of the modules upon which Bioperl is dependent.) See the copious installation instructions at bioperl.org.