Code (example 9-1) for translating IUB ambiguity codes to regexps.
Code (example 9-2) for parsing a REBASE datafile.
Here's how example9-2 iterates through the contents of the REBASE file:
@rebasefile = get_file_data($rebasefile); foreach ( @rebasefile ) {
(Recall that get_file_data reads contents of the file, returning an array for which each element is a line from the file.) Alternatively, we could use this technique:
# Read in the REBASE file my $rebase_filehandle = open_file($rebasefile); while(<$rebase_filehandle>) {
Code (example 9-3) for looking for restriction sites (defined in a REBASE datafile) in a DNA strand (defined in a FASTA datafile.)
By the way, in this code we use the "$&" global variable, which
evaluates to the match of the most recent =~ operator. Likewise "$'"
evaluates to teh string following the match and "$`" is the string
that preceded the match. What are the values of these variables after '123456'
=~ /23/
?
A major international databank for known genetic sequences. We can see a sample in record.gb. For a complete specification of the syntax of these datafiles, see gbrel.txt. Each genbank file contains a set of records (our sample file consists of a single record). Each record consists of an annotation section followed by a sequence. There are two primary methods for parsing this information into those major sections. First, we can read all the record into an array (of lines), then concatenate where appropriate to form a single string for each section. This code (example 10-1) does that. There are several interesting syntax issues regarding example 10-1:
foreach my $line (@GenBankFile)
). In this case we
need just one boolean flag because there are only two sections within a record.Alternatively, we can read the entire record into a single string (necessarily containing newline chars), then use regular expressions to parse the record into its constituent sections. This code (example10-2) does that. Again, there are several interesting aspects to this code that we haven't seen before:
$save_input_separator
?($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s);
Why do we want the parentheses embedded in the target?"CCA\nGGT\nTTA" =~ /^.*$/m; print $&, "\n";
"CCA\nGGT\nTTA" =~ /^.*$/s; print $&, "\n";
Probably the main thing to notice about this code is its relative simplicity
compared to the first version. We've replaced an entire looping construct with
a single line containing a regular expression. This second program harnesses
the power of Perl's built-in regular expression matching machinery. There are
no explicit loops in the code because the iteration is now implicit within the
=~
operator. As you become more comfortable with regular expressions,
you will find yourself using them more and more frequently.
Next, we want to parse the annotations themselves. We can use the same two methods: this code (example 10-3) parses an annotation section using an array of lines with an iteration construct. This first version parses only fields (LOCUS, etc.) that consist of a single line. To parse the multi-line DEFINITION field, we again use the trick of a flag variable to indicate the parsing state, as show in example 10-4.This technique can become very complicated as the number of multi-line fields increases. Instead, we'll try to use regular expressions to do most of the parsing....
First, let's look at a program (example 10-5)
that separates the annotation and sequence into two scalar strings, then uses
regular expressions for a simple level of parsing. The mainline makes use of
the tell
opertor, which returns a byte-offset within the file (much
like the pos
operator for strings).Here we are considering parsing
a GenBank file containing more than one record, and we want to be able to identify
where each record begins. The companion operator to tell
is seek
,
which moves a file pointer to the given byte-offset. Check out the "WHAT'S
GOING ON HERE" lines within the file. Note that this program runs on a
larger GenBank file (library.gb)
that contains several records. There are a lot of subroutines here, many of
which are also in the BeginPerlBioinfo library. You should understand all of
them at this point.
Now we continue, parsing the annotation, placing each of its fields within a hash table. The label of each field serves as the key, while the associated value is the contents of that field. Here's the code (example 10-6) that does that (and only to the first record within the GenBank file.)
Last, we extend this program by parsing the FEATURES table. The list of possible sub-groups within FEATURES is in the textbook, pp.225-8. Because there may be several instances of some of these subgroups, we'll store their values in an array, instead of a hash table, which allows only one entry per key. This program (example 10-7) strips out each feature (and any associted sub-features) and stores them in the array.
Perl provides a simple database capability: hash tables can be stored in files,
and then restored/reloaded on subsequent invocations of the program. The Perl
tool is called DBM (data-base management). It consists primarily of just two
operators: dbmopen
and dbmclose
. (Actually, these
operators are somewhat obsolete, having been supplanted by the tie
operator, but explaining that presupposes we know how to use Perl classes, and
we haven't got that far yet. tie
allows us to bind arrays and other
objects to external databases, not just hash tables. If
you want to jump ahead though, go here.) The syntax for opening a dbm file
is:
dbmopen(%hashVariable, 'nameOfDBM_File', unixModeInOctal)
Where nameOfDBM_File is the name of the database, without any suffix.
If no such database exists, one will be created. DBM supports several different,
but similar, database formats (dbm, ndbm, sdbm, etc.) Depending on the format
used by your Perl implementation, the database will be stored as one or two
files, whose names will be nameOfDBM_File with a suffix. Cygwin's Perl
uses a.dir
and .pag
suffixes.
The unixModeInOctal field is an octal value (presented as a decimal integer) that specifies the access priveleges for the database files. These priveleges are specified via the same technique as in Unix: a four digit octal number:
ABCD
Where B specifies the priveleges of the owner of the file, C specifies those
of members of the "group" owning the file (Unix allows users to form
groups, and all files are owned by both a single user and a single group), and
D specifies the priveleges of everyone else. (The A digit is not germane to
these database files. See the Unix documentation for the chmod
command for a complete explanation.) Each digit can be thought of in terms of
the three bits needed to encode it:
XYZ
If X is 1, read permission is granted. Y grants write permissions. (Z grants executable permission, but that is not germane for database files, as they are not executables.) Thus a permission mode of 0666 gives everyone read and write access to the database, 0644 gives the owner read and write permission, but everyone else only read permissions and 0600 gives the owner read and write permissions, but no one else can access the database.
Here is a relatively straightforward program (example10-8.pl) that creates a hash tied to a DBM file and then reads the records from a GenBank file, storing in the database the mapping from the accession number of each GenBank record to the byte-offset to that record within the GenBank file. Later, the program uses the DBM database to quickly locate the records corresponding to user provided accession numbers within the GenBank file. In this example code, this is not much of a feat, as the GenBank file contains only a few records. Typical GenBank files, however, would contain hundreds or thousands of records, and the ability to access records via their byte-offsets will greatly speed data access. You should trying running the program within the debugger to see how the various values are handled.
A note of warning about DBM databases: remember that the hashes may be linked
to huge databases. If that is the case then applying the values or keys operator
to the hash can return huge lists, so huge they might exhaust the memory. I.e.,
@bob = keys %myDBMdb
might conceivably crash your program if %myDBMdb
is tied to a very large database. In that case, you would be better off iterating
through the hash table via the each
operator:
# print out history file offsets dbmopen(%HIST,'/usr/lib/news/history',0666); while (($key,$val) = each %HIST) { print $key; } dbmclose(%HIST);