First, we define a class to read and write with files, FileIO.pm. The class is essentially nothing more than the name of the file (filename) and an array (filedata) equal to the contents of the file. (In addition there is a date and writemode field.) In other words, a FileIO object is a lot like a cached file.
The important methods in this class are read and write, which move data between filedata and the associated file. The read method appears to be overly general here, because it apparently allows for the FileIO's attributes to be set via the argument list, yet %attribute_properties indicates that all attributes are "noinit", meaning they can't be set in this manner. The reason for this approach is to enable subclasses to use read, as we'll see below.
The statement that does most of the work is:
$self->{'_filedata'} = [ < FileIOFH > ];
The write method is similar. First it allows the caller to set the attributes of the FileIO object via the argument list, then it writes the FileIO object to its file, using the mode (such as > for overwriting the current file contents, and >> for appending to the file.) The statement that does most of the work is:
unless( open( FileIOFH, $self->get_writemode . $self->get_filename ) )
We can test this class with testFileIO. In this file, note the use of the write method: we change the value of the filename field here via the parameter list to write so that the FileIO object will be written to a different file than it was initialized from.
In this section we create a "subclass" of FileIO that is specialized
to handle files that contain biological sequences. That class, SeqFileIO, is
defined via the file SeqFileIO.pm.
The most important thing to note in this class is the use of the base
operator. If a Perl cannot find a method of a given name in a class, this mechanism
causes Perl to look for the definition of that method in superclasses. For example,
suppose we have a SeqFileIO object, $obj
. Then $obj->get_count
will invoke the get_count method defined in FileIO.
There are some seeming oddities. Note that SeqFileIO redefines _all_attributes
,
_permissions
and _attribute_default,
even those these
definitions are identical to the ones in FileIO. This is because those methods
are defined within a closure. We don't want SeqFileIO to use the inherited versions,
because those would refer to the the value of %_attribute_properties
in FileIO rather than in SeqFileIO. Now when the (inherited) AUTOLOAD method
is invoked it will invoke these closure-related methods, and the appropriate
ones will be invoked (depending on whether the invokee was a FileIO or a SeqFileIO).
Several fields are added to specialize a FileIO into a SeqFileIO. First, there
is an array, _seqfileformats
explicitly detailing the various file
formats the SeqFileIO class can recognize. In other words, specifying what types
of biological sequence data the class can handle. For each file format, x,
SeqFileIO provides three methods, is_x
, parse_x
and
put_x
.
is_x
methods check to see if the contents of filedata
corresponds to the format. The tests are only approximations, but do a pretty
good job, and are much faster than running a complete recursive descent parser
on the data. One nifty thing about these methods is their use of the simple
statement return; to provide a return value. This is a standard Perl trick
because it returns undef
in a scalar context and the empty list
in a list context. Either way, the result is effecitively "false".parse_x
methods attempt to strip out the actual sequence
information from filedata, possibly along with some other useful information.
put_x
methods attempt to reutrn an array of strings that
would correspond to the contents of filedata
if written into
a file using the appropriate syntax for the specified file format, x. The
assumption is that the caller will write this array into a file.We can initialize a SeqFileIO object from a Genbank file, such as record.gb, via the code:
my $genbank = SeqFileIO->new(); $genbank->read( filename => 'record.gb' );
Then we can fiddler with these methods by debugging testSeqFileIO.
Try it: set a breakpoint just beyond where the code above appears in testSeqFileIO.
Go ahead and experiment with $genbank
. Here's a sample session:
main::(testSeqFileIO:72): print "\n####################\n################# ###\n####################\n"; DB<3> print $genbank->get_header AB031069 2487 bp mRNA PRI 27-MAY-2000 Sequence severely truncated for demonstration. AB031069 DB<4> print $genbank->get_sequence AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGGAGACGACGCATGGGGCCTGC GCAATAGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCG CCGAGTGGTCTGTGCAGGTTCGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGAT AAAAAAAAAAAAAAAAAAAAAAAAAAA DB<5> @matt = $genbank->put_genbank DB<6> p "@matt" LOCUS AB031069 267 bp DEFINITION AB031069 2487 bp mRNA PRI 27-MAY-2000 Seque nce severely truncated for demonstration. AB031069 , 267 bases, 829 sum. ACCESSION AB031069 ORIGIN 1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg 61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg 121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt 181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat 241 aaaaaaaaaa aaaaaaaaaa aaaaaaa // DB<7>
In the above session, note that the header field holds a lot less information than is provided in the annotation section of record.gb! This indicates that information can be lost if these methods are used to translate a file from one format to another. Indeed, in the example above, I am translating from one format to the same format. While the output generated does conform to the syntax of a Genbank file, it is missing many of the details present in the annotation (i.e., the "header") of the original file.
It is harder to play with the various parse_ methods because they are invoked only indirectly by read. The idea is to first load the contents of a sequence file into the SeqFileIO object, then use the is_ methods to determine the file format of the contents of the file. The corresponding parse_ method is then used to initialize the attribute fields (such as sequence, header, id and accession) of the SeqFileIO object. The sequence data could then be manipulated, and a put_ method could be used to store it in a file in whatever format we wanted.
The read() method is particularly clever. Notice how it goes about invoking the parse_ method that is appropriate to the file format with the code:
$self->{'_format'} = $self->isformat; my $parsemethod = 'parse' . $self->{'_format'}; $self->$parsemethod;
I urge you to use the debugger to examine execution of the class for testing SeqFileIO, testSeqFileIO.