BLAST is one of the most important tools in the Bioinformaticist's tool chest. It is a tool for identifying matches of a given biological sequence to a database of other sequences. For example, a researcher may have identified a sequence of DNA and want to know if this sequence matches any known DNA sequence. BLAST is actually an acronym for a string matching algorithm, and there are many flavors and implementations of this algorithm. Some of these implementations are free--you could download them and run your own BLAST server locally. That would require a fair amount of memory to hold the large databases of sequences, and a fair amount of processing power to effect the searches. For now we'll be looking at the freely accessible BLAST server at the National Center for Biotechnology Information (NCBI) at http://www.ncbi.nlm.nih.gov/BLAST/.
As an example, go to this site, and access the nucleotide-nucleotide server (BLASTN). Try copying the first few lines of the DNA sequence in our sample FASTA file we looked at earlier, sample.dna, into the BLAST search engine and see what you get.
OK, so what is a BLAST "match"? Biologists are interested in similarity as an indication of homology. Similarity between queries and database sequences is measured by the percent identity, which is the number of bases in the query that exactly match those of the sequence. Another metric is the degree of conservation, which detects matches between codons or amino acid residues. In this case, the match can be in terms of what the codons express (recall that some peptides are expressed by several different codons.) I.e., the sequences can differ, so long as they express the same proteins.
BLAST will return a set of possible matches, and some statistics that identify the probability of the match being genuine. The statistics consist of three parts: a raw score, S, a measure of similarity and size of match. The expect value, E, which is the probability that the same match might occur in a randomly generated DB of the same size. The closer to 0.0, the more likely that the match is genuine. Values less than 1 can be considered a "solid hit", and values less than 10 are possibly worth looking into.
The file blast.txt is sample output generated by the BLASTN server above. In examining it, we can see that the output, though long, is separated into three parts: the beginning annotation (everything preceding "ALIGNMENTS"), the alignments (preceding "Database"), and the ending annotation. The program example12-1.pl parses the sample file. The first part of the parse_blast subroutine extracts these three parts into separate variables.
The trickiest part of the whole process is the parsing of the alignments, which is done in the parse_blast_alignment subroutine. There you will find this complicated regular expression:
while($alignment_section =~ /^>.*\n(^(?!>).*\n)+/gm) {
Most of the regexp is straightforward, but there is one new feature: (?!>)
.
This is a negative lookahead assertion. The intent is to ensure that
the next line does not begin with a >
. The next somewhat
complicated piece of code is the line
my($key) = (split(/\|/, $value)) [1];
Can you see what that does? Use the debugger to break at this line and take a look at the value of $&. The third time through the while loop (thus, accessing the third alignment in the alignment section), we get this from the debugger:
DB<5> p $key AK017941.1 DB<6> p $value >dbj|AK017941.1|AK017941 Mus musculus adult male thymus cDNA, RIKEN full-length enriched library, clone:5830420C16, full insert sequence Length = 1461 Score = 210 bits (106), Expect = 5e-52 Identities = 151/166 (90%) Strand = Plus / Plus Query: 235 ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg 294 |||||||| ||||||| || ||||| ||||||||||| ||||||||||| ||||||||| Sbjct: 1048 ggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaatggg 1107 Query: 295 gagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcggg 354 ||||| || ||||||||||||||||| ||||||||||||||||| ||||||||||| || Sbjct: 1108 gagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatgattgga 1167 Query: 355 tgtgacaactgcaatgagtggttccatggggactgcatccggatca 400 |||||||||||||| |||||||||||||| |||||||||||||||| Sbjct: 1168 tgtgacaactgcaacgagtggttccatggagactgcatccggatca 1213 Score = 44.1 bits (22), Expect = 0.066 Identities = 25/26 (96%) Strand = Plus / Plus Query: 118 gcggaagtagttgtgggcgcctttgc 143 ||||||||||||| |||||||||||| Sbjct: 235 gcggaagtagttgcgggcgcctttgc 260
This shows two high-scoring pairs (HSP's) for the database sequence labelled AK017941.1. Each HSP is composed of two parts, an annotation, and the matching pair itself. The latter shows the respective positions of the matching substrings in the query sequence and in the sequence from the server's database. So here, there is a match of a substring starting at position 235 in the query with a substring starting at position 1048 in the BLAST database sequence AK017941.1.
Program example12-2.pl
provides a method, parse_blast_alignment_HSP
for parsing the alignment
section of BLAST output into the HSP's. Given an alignment section, it returns
an array of strings, each of which is the HSPs for one database sequence. (Actually,
element [0] is the preamble of the alignment that precedes the first HSP. In
our example above, it is everything from the ">dbj" up to just
before the first appearance of the word "Score".) We will use "Score"
as the delimiter for the HSPs. Thus the matching operation:
($beginning_annotation, $HSP_section ) = ($alignment =~ /(.*?)(^ Score =.*)/ms);
extracts the preamble just mentioned. Recall that the ? at the end of the regular expression (.*?) modifies the * to be "minimal". Thus, .*? will match the shortest substring that can be found that precedes " Score". The default behavior would be to find the longest such string. The next section of the program
while($HSP_section =~ /(^ Score =.*\n)(^(?! Score =).*\n)+/gm) { push(@HSPs, $&);
extracts each matching pair (the "Query"/"Sbjct" sets),
along with its header information ("Score"....) The ?! at the beginning
of embedded parentheses is the negative lookahead assertion again. So (^(?!
Score =).*\n)+
will match any number of lines that do not start
with "Score =". (The "lookahead" aspect of this operator
means that when a line starting with "Score =" is finally found, it
will not be "consumed", but remain for the next iteration of the global
search.)
Use the debugger to see what parse_blast_alignment_HSP
returns.
Okay, now that we've got the HSPs, the program parses one of them, as an example, via a call to extract_HSP_information. This subroutine has a number of interesting features. To begin with, the given argument to the subroutine (@_) is one match of an alignment, such as:
Score = 210 bits (106), Expect = 5e-52 Identities = 151/166 (90%) Strand = Plus / Plus Query: 235 ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg 294 |||||||| ||||||| || ||||| ||||||||||| ||||||||||| ||||||||| Sbjct: 1048 ggagatggctcagacctggaacctccggatgccggggacgacagcaagtctgagaatggg 1107 Query: 295 gagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcggg 354 ||||| || ||||||||||||||||| ||||||||||||||||| ||||||||||| || Sbjct: 1108 gagaacgctcccatctactgcatctgtcgcaaaccggacatcaattgcttcatgattgga 1167 Query: 355 tgtgacaactgcaatgagtggttccatggggactgcatccggatca 400 |||||||||||||| |||||||||||||| |||||||||||||||| Sbjct: 1168 tgtgacaactgcaacgagtggttccatggagactgcatccggatca 1213
We want to form the entire matching string from the query sequence by concating its pieces. This statement does that:
$query = join ( '' , ($HSP =~ /^Query(.*)\n/gm) );
Recall that in a list context, the =~ operator returns an array of all the matches. The 'g' modifier means that we'll get an array consisting of each match for (.*), i.e:
DB<15> @bob = ($HSP =~ /Query(.*)\n/gm) DB<16> p "@bob" : 235 ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg 294 : 295 g agaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcggg 354 : 355 tgtgacaac tgcaatgagtggttccatggggactgcatccggatca 400 DB<17> p scalar @bob 3 DB<18> p $bob[0] : 235 ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg 294
[Ignore the linebreaks. They're just necessary to fit the text onto this page.] The join operator merely concatenates all the elements of the array into a single string. All that remains is to delete all the characters that are not DNA letters.
The subroutine then extracts the range of the match. This is a pair consisting of the indices into the sequence defining the match. The range for the match in the query string is 235..400. But those numbers are simply at the head an tail of $query, so we get them with
$query_range = join('..', ($query =~ /(\d+).*\D(\d+)/s));
The debugger shows what's going on:
DB<2> p $query : 235 ggagatggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggg 294: 295 ga gaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcggg 354: 355 tgtgacaactg caatgagtggttccatggggactgcatccggatca 400 DB<3> $sue = ($query =~ /(\d+).*\D(\d+)/s) DB<4> p $sue 1 DB<5> @sue = ($query =~ /(\d+).*\D(\d+)/s) DB<6> p "@sue" 235 400
The join
simply inserts the '..' between the elements of the array
in a concatenation.
There are several mechanisms for formatting output in Perl. We've already briefly
looked at one of them, the printf
function, which is borrowed from
the C programming langauge. You can view the documentation for printf
via perldoc, though that will probably just redirect you to the sprintf
function, which returns a formatted string rather than printing one.
If you want to generate multiline output, such as a form, many times, then
you will want to use either the here construct, or format
and write
. Example 12-3, from Tisdall's book shows usage of a here
document:
#!/usr/bin/perl # Example of here document use strict; use warnings; my $DNA = 'AAACCCCCCGGGGGGGGTTTTTT'; for( my $i = 0 ; $i < 2 ; ++$i ) { print <<HEREDOC; On iteration $i of the loop! $DNA HEREDOC } exit;
Here's the output:
On iteration 0 of the loop! AAACCCCCCGGGGGGGGTTTTTT On iteration 1 of the loop! AAACCCCCCGGGGGGGGTTTTTT
Example 12-4, from Tisdall's book shows use of the format & write technique, which is borrowed from Fortran:
#!/usr/bin/perl # Create fasta format DNA output with "format" function use strict; use warnings; # Declare variables my $id = 'A0000'; my $description = 'Highly weird DNA. This DNA is so unlikely!'; my $DNA = 'AAAAAACCCCCCCCCCCCCCGGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTTTTTTTTTTTTT'; # Define the format format STDOUT = # The header line >@<<<<<<<<< @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<... $id, $description # The DNA lines ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<~~ $DNA . # Print the fasta-formatted DNA output write; exit;
Here's the output:
>A0000 Highly unlikely DNA. This DNA is so... AAAAAACCCCCCCCCCCCCCGGGGGGGGGGGGGGGGGGGGGGTTTTTTTTT TTTTTTTTTTTT
After declaring and initializing the variables that fill in the form, the form is defined with:
format STDOUT =
and the format continues until it reaches the line with a period at the beginning.
The format is composed of three kinds of lines:
A comment beginning with the pound sign #
An argument line that names the variables that fill in the preceding picture line
The picture line and the argument line must be adjacent; they can't be separated by a comment line, for instance.
There are a great many modules defined for Bioinformatics processing. You should download and install Bioperl from www.bioperl.org to try these out. (You may also need to go to www.cpan.org to get some of the modules upon which Bioperl is dependent.) See the copious installation instructions at bioperl.org.