Implementing a Significant Class

Chapter 5 of the textbook describes, in detail, the process of creating a class for handling restriction enzymes and restriction maps. Again, I'll just try to hit the high points. Please read the text in combination with these lecture notes.

We looked at restriction enzymes in an earlier lecture; they are chemicals that can be used to split DNA sequences at particular locations. A restriction map indicates where a given DNA sequence might be split by a particular restriction enzyme. Each of these enzymes is effectively a regular expression that can be used to find the possible splitting points.

We will work with two classes. The Rebase class will store the information from the Rebase database—a flat file basically consisting of two columns: the first lists the names of restriction enzymes, and the second lists the restriction sites (which can easily be interpretted as regular expressions). Here is an older version (v212) of this database, newer ones can be found on-line, at ftp://ftp.neb.com/pub/rebase, just look for a file named bionet.xxx. (As of this writing, v403).

The Rebase object stores the database as a hash table (the _rebase field), with restriction enzyme names as the keys and the sites as the values. It is possible for multiple names to refer to the same site. In addition, the object can be tied to a DBM file, which will speed future reinstantiations of the object after it is initially created by reading the database file, bionet.xxx. (We looked at DBM databases in an earlier lecture.)

Now, examine the new() method in Rebase.pm. Notice the statement:

    unless($arg{dbmfile}) 

Do you remember why dbmfile isn't quoted?

Now look to where the DBM file is actually accessed via the code:

    unless(tie %{$self->{_rebase}}, 'DB_File', $arg{dbmfile}, O_RDWR|O_CREAT, $self->{_mode}, $DB_HASH) {
        my $permissions = sprintf "%lo", $self->{_mode};
        croak "Cannot open DBM file $arg{dbmfile} with mode $permissions";
    }

First, notice that we are accessing the DBM file via the tie operator rather than dbmopen. I won't go into all the syntax of tie (see the perldoc if you are interested) but the basic idea is that we are associating a particular variable (the hash table in our Rebase object in this case) with a particular class (DB_File, here—a predefined class in Perl. If you are interested in the gory details, see perldoc DB_File.) The remainder of the arguments to the tie function will be passed as arguments to the new() method of the particular class. These arguments specify, in order, the name of the DBM file (the value of $arg{dbmfile}), how to go about opening the file (O_RDWR|O_CREAT), the permissions to be associated with the file ($self->{_mode}), and how the database is to be encoded (as a hash, a tree, an array, etc. Here the value is $DB_HASH).

The expression defining how the file is to be accessed, (O_RDWR|O_CREAT), deserves a bit of explanation. The argument is interpreted as a bit string. O_RDWR and O_CREAT are integers the bits of whose binary representation are all 0's except for one bit, and these will be in differing positions. Thus the bitwise logical OR operation, |, forms a single integer whose binary representation has two 1 bits. In this case the expression indicates that the file should be opened for reading and writing, and that if the file does not already exist, it should be created.

The _mode argument specifies the access permissions for the file in a syntax borrowed from Unix. If you want the file to be readable and writable by everyone, the permission should be octal 777, specified as O777, in Perl syntax. (Note that the first character in O777 is an "oh", not a "zero".) If you want the file to be readable and writable only by yourself, use O444. If you'd like members of your local "group" to have access to the file, use O666. (If you'd like full documentation on these modes, try this documentation on the Unix command chmod.) Try running testRebase in debug mode. Step until you get into Rebase.pm, then set a breakpoint at line 75, the first line past the tie command shown above. Examine the value of the _mode field:

Rebase::new(Rebase.pm:75):          if($arg{bionetfile}) {
  DB<3> p $self->{_mode}
420
  DB<4> p sprintf "%lo", $self->{_mode}
644
  DB<5>

Because the Rebase class has so few fields (4) we don't really need the full power of the AUTOLOAD facility to create mutators and accessors; we'll just code them manually. There are two methods that act a lot like accessors, get_regular_expressions and get_recognition_sites, that are used to access the regular expressions and recognition sites that are bound to a particular restriction enzyme by name. An example:

  DB<3> p %$rebase
_rebaseHASH(0x102c63c4)_bionetfilebionet.212_mode420_dbmfileBIONET
  DB<4> p $rebase->get_recognition_sites('AarI')
NNNNNNNNGCAGGTGCACCTGCNNNN
  DB<5> p $rebase->get_regular_expressions('AarI')
[ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT]GCAGGTGCACCTGC[ACGT][ACGT][ACGT]
[ACGT]
  DB<6>

Most of the heavy lifting of the Rebase class is done by the parse_rebase method. It skips over the header lines of the file to get to the lines that define each enzyme. Each such line defines one enzyme. The method iterates over the fields of each line, creating an entry in the hash table, mapping the name(s) of the enzyme to its restriction be scalars. Consequently the only way we can easily have an enzyme name map to both its site and regular expression is to store the latter pair as a simple string, with the two fields separated by whitespace. For example (continuing our debugging session) the entry corresponding to the enzyme named "AaaI" is a string consisting of two fields. The first is the restriction site, and the second is the corresponding regular expression. (In this case they are identical.)

141:        while() {
  DB<9> p $self->{_rebase}{'AaaI'}
CGGCCG CGGCCG
  DB<10>

Identifying Recognition Sites

Now that we've codified the restriction site information of the enzymes we can go about determining whether--and where--a set of enzymes will match against a proferred sequence. The Restriction class contains a reference to a Rebase object (our "database" of enzymes), a DNA sequence, a list of the names of enzymes for which to try to find the restriction sites in the sequence, and a map. This map is the core of the Restriction class. It is created by the Restriction object and indicates the locations (specified as character offsets) at which each enzyme's restriction site can be found in the sequence.

The map is generated by the map_enzyme method which is invoked by the constructor. Most of the code there is self-explanatory except possibly for the use of the $-[0] special variable in the method match_positions, which returns an array consisting of all the positions at which a given regular expression matched the sequence. The special variable is equal to the offset from the beginning of the target string for each successful match of the regular expression.

The file testRestriction uses a Restriction object to locate all the occurrences of the enzyme 'EcoRI' in a very small sequence ("ACGAATTCCGGAATTCG"). Here's the program running:

$ perl testRestriction
EcoRI data in Rebase is GAATTC GAATTC
Sequence is ACGAATTCCGGAATTCG
Locations for EcoRI are 3 11

Okay, so it's not very exciting. Worse, for longer sequences there may be a great many matching locations; a long string of numbers output will not be very helpful. So the next step is to convert this to a more graphical output. To that end, we create a Restrictionmap class, a sub-class of Restriction, that contains a text-based graphical representation of the mapping locations. This subclass adds two fields: graphictype and graphic. The latter will hold the text-based graphics as a string. The former isn't really used much yet, but indicates the type of graphic. In our case, it will be "text".

The method _drawmap_text() does most of the work of generating the graphic. The graphic is an array of strings that should be displayed one after another. The result should be the original DNA sequence displayed across several lines, separated by blank lines. If a restriction site map is found, the enzyme's name is displayed immediately above the portion of the DNA sequence at which it matches. The generation of the array of strings is difficult because enzymes may map to roughly the same location, in which case the enzyme names must be drawn one above the other. The file testRestrictionmap demonstrates the utility of the Restrictionmap class. Here is a portion of the output it generates:


                                             EcoRI
CTCTGCTTCGCCCCACAAATCCTCTCCGCAGCCCTTGGTGGCCACGAATT



CACCCAGCCAGCATCACCAGCAGCAGCAGCAGCAGATCAAACGGTCAGCC

HindIII
      EcoRI
AAGCTTGAATTCCGCATGTGTGGTGAGTGTGAGGCACCAGTGACGCCCTC



AGAGTCCCTGCCAAGGCCCCGCCGGCCACTGCCCACCCAACAGCAGCCAC


                   EcoRI
AGCCATCACAGAAGTTAGGGAATTCGCGCATCCGTGAAGATGAGGGGGCA

The output primarily consists of the given sequence (in this case a nucleotide sequence). At a position where one of the given restriction enzymes (such as EcoRI or HindIII) would map, its name is placed above the sequence. You can see that it is possible for two such locations to overlap, in which case the enzyme names are stacked.