Restriction Maps and Regular Expressions

Code (example 9-1) for translating IUB ambiguity codes to regexps.

Code (example 9-2) for parsing a REBASE datafile.

Here's how example9-2 iterates through the contents of the REBASE file:

    @rebasefile = get_file_data($rebasefile);

    foreach ( @rebasefile ) {

(Recall that get_file_data reads contents of the file, returning an array for which each element is a line from the file.) Alternatively, we could use this technique:

    # Read in the REBASE file
    my $rebase_filehandle = open_file($rebasefile);

    while(<$rebase_filehandle>) {

Code (example 9-3) for looking for restriction sites (defined in a REBASE datafile) in a DNA strand (defined in a FASTA datafile.)

By the way, in this code we use the "$&" global variable, which evaluates to the match of the most recent =~ operator. Likewise "$'" evaluates to teh string following the match and "$`" is the string that preceded the match. What are the values of these variables after '123456' =~ /23/ ?

GenBank:

A major international databank for known genetic sequences. We can see a sample in record.gb. For a complete specification of the syntax of these datafiles, see gbrel.txt. Each genbank file contains a set of records (our sample file consists of a single record). Each record consists of an annotation section followed by a sequence. There are two primary methods for parsing this information into those major sections. First, we can read all the record into an array (of lines), then concatenate where appropriate to form a single string for each section. This code (example 10-1) does that. There are several interesting syntax issues regarding example 10-1:

Alternatively, we can read the entire record into a single string (necessarily containing newline chars), then use regular expressions to parse the record into its constituent sections. This code (example10-2) does that. Again, there are several interesting aspects to this code that we haven't seen before:

Probably the main thing to notice about this code is its relative simplicity compared to the first version. We've replaced an entire looping construct with a single line containing a regular expression. This second program harnesses the power of Perl's built-in regular expression matching machinery. There are no explicit loops in the code because the iteration is now implicit within the =~ operator. As you become more comfortable with regular expressions, you will find yourself using them more and more frequently.

Parsing GenBank Annotations

Next, we want to parse the annotations themselves. We can use the same two methods: this code (example 10-3) parses an annotation section using an array of lines with an iteration construct. This first version parses only fields (LOCUS, etc.) that consist of a single line. To parse the multi-line DEFINITION field, we again use the trick of a flag variable to indicate the parsing state, as show in example 10-4.This technique can become very complicated as the number of multi-line fields increases. Instead, we'll try to use regular expressions to do most of the parsing....

First, let's look at a program (example 10-5) that separates the annotation and sequence into two scalar strings, then uses regular expressions for a simple level of parsing. The mainline makes use of the tell opertor, which returns a byte-offset within the file (much like the pos operator for strings).Here we are considering parsing a GenBank file containing more than one record, and we want to be able to identify where each record begins. The companion operator to tell is seek, which moves a file pointer to the given byte-offset. Check out the "WHAT'S GOING ON HERE" lines within the file. Note that this program runs on a larger GenBank file (library.gb) that contains several records. There are a lot of subroutines here, many of which are also in the BeginPerlBioinfo library. You should understand all of them at this point.

Now we continue, parsing the annotation, placing each of its fields within a hash table. The label of each field serves as the key, while the associated value is the contents of that field. Here's the code (example 10-6) that does that (and only to the first record within the GenBank file.)

Last, we extend this program by parsing the FEATURES table. The list of possible sub-groups within FEATURES is in the textbook, pp.225-8. Because there may be several instances of some of these subgroups, we'll store their values in an array, instead of a hash table, which allows only one entry per key. This program (example 10-7) strips out each feature (and any associted sub-features) and stores them in the array.

Indexing GenBank Records in a DBM Database

Perl provides a simple database capability: hash tables can be stored in files, and then restored/reloaded on subsequent invocations of the program. The Perl tool is called DBM (data-base management). It consists primarily of just two operators: dbmopen and dbmclose. (Actually, these operators are somewhat obsolete, having been supplanted by the tie operator, but explaining that presupposes we know how to use Perl classes, and we haven't got that far yet. tie allows us to bind arrays and other objects to external databases, not just hash tables. If you want to jump ahead though, go here.) The syntax for opening a dbm file is:

dbmopen(%hashVariable, 'nameOfDBM_File', unixModeInOctal)

Where nameOfDBM_File is the name of the database, without any suffix. If no such database exists, one will be created. DBM supports several different, but similar, database formats (dbm, ndbm, sdbm, etc.) Depending on the format used by your Perl implementation, the database will be stored as one or two files, whose names will be nameOfDBM_File with a suffix. Cygwin's Perl uses a.dir and .pag suffixes.

The unixModeInOctal field is an octal value (presented as a decimal integer) that specifies the access priveleges for the database files. These priveleges are specified via the same technique as in Unix: a four digit octal number:

ABCD

Where B specifies the priveleges of the owner of the file, C specifies those of members of the "group" owning the file (Unix allows users to form groups, and all files are owned by both a single user and a single group), and D specifies the priveleges of everyone else. (The A digit is not germane to these database files. See the Unix documentation for the chmod command for a complete explanation.) Each digit can be thought of in terms of the three bits needed to encode it:

XYZ

If X is 1, read permission is granted. Y grants write permissions. (Z grants executable permission, but that is not germane for database files, as they are not executables.) Thus a permission mode of 0666 gives everyone read and write access to the database, 0644 gives the owner read and write permission, but everyone else only read permissions and 0600 gives the owner read and write permissions, but no one else can access the database.

Here is a relatively straightforward program (example10-8.pl) that creates a hash tied to a DBM file and then reads the records from a GenBank file, storing in the database the mapping from the accession number of each GenBank record to the byte-offset to that record within the GenBank file. Later, the program uses the DBM database to quickly locate the records corresponding to user provided accession numbers within the GenBank file. In this example code, this is not much of a feat, as the GenBank file contains only a few records. Typical GenBank files, however, would contain hundreds or thousands of records, and the ability to access records via their byte-offsets will greatly speed data access. You should trying running the program within the debugger to see how the various values are handled.

A note of warning about DBM databases: remember that the hashes may be linked to huge databases. If that is the case then applying the values or keys operator to the hash can return huge lists, so huge they might exhaust the memory. I.e., @bob = keys %myDBMdb might conceivably crash your program if %myDBMdb is tied to a very large database. In that case, you would be better off iterating through the hash table via the each operator:

 # print out history file offsets
 dbmopen(%HIST,'/usr/lib/news/history',0666);
 while (($key,$val) = each %HIST) {
     print $key;
 }
 dbmclose(%HIST);