We have examined how to create data structures in Perl using references, hashes and arrays in various combinations. These techniques should seem extremely cumbersome to a computer scientist accustomed to modern programming languages. There are no named fields, no type-checking, etc. In effect, the entire burden of ensuring that the data structures are defined and used correctly falls upon the programmer. The Perl compiler provides little if any aid. Luckily, recent versions of Perl include object-oriented concepts such as classes and methods. The syntax for these structures borrows much from other OO languages such as C++ and Java.
The bioperl module defines a number of classes that are useful in Bioinformatics programming, so we'll use many of these classes as examples here.
use Bio::Seq; use Bio::SeqIO;
Though these look like other package inclusions we've seen, we're actually accessing the classes Seq and SeqIO. Classes are implemented as packages. An object is a reference to a data structure within the class. A method is a subroutine of the class. Most objects are implemented as hashes that are accessed with blessed references to the hashes. (Blessing an object means attaching its class's name to that object.) Each field of an object is called an attribute and is assigned a value. This is accomplished via a hash of the field names to their corresponding values. Here is some sample code:
$bob = TRNA->new(); $bob->findloops();
The first line is interesting because TRNA is the name of a package/class,
not a reference to anything. So how is the ->
operator working?
Because the value to the left of the arrow is a class name, Perl invokes the
subroutine (method) new in that class. This method returns a reference to an
object (a blessed hash, actually), which is stored in $bob. The argument
list to new consists first of the name of the class, "TRNA", followed
by whatever arguments are explicitly provided.
In the second line, Perl knows which subroutine to call because the $bob
object has been blessed with the identity of its class, TRNA. (As with other
OO languages, a reference to $bob
is implicitly passed as the first
argument to findloops()
.
Here is a sample module that defines a new class, Gene1. Here is a sample program that makes use of that class:
use strict; use warnings; use lib "/home/tisdall/MasteringPerlBio/development/lib"; use Gene1; print "Object 1:\n\n"; my $obj1 = Gene1->new( name => "Aging", organism => "Homo sapiens", chromosome => "23", pdbref => "pdb9999.ent" ); print $obj1->name, "\n"; print $obj1->organism, "\n"; print $obj1->chromosome, "\n"; print $obj1->pdbref, "\n"; print "Object 2:\n\n"; my $obj2 = Gene1->new( organism => "Homo sapiens", name => "Aging", ); print $obj2->name, "\n"; print $obj2->organism, "\n"; print $obj2->chromosome, "\n"; print $obj2->pdbref, "\n"; print "Object 3:\n\n"; my $obj3 = Gene1->new( organism => "Homo sapiens", chromosome => "23", pdbref => "pdb9999.ent" ); print $obj3->name, "\n"; print $obj3->organism, "\n"; print $obj3->chromosome, "\n"; print $obj3->pdbref, "\n";
Take a close look at the new() method in the Gene1 class:
sub new { my ($class, %arg) = @_; return bless { _name => $arg{name} || croak("no name"), _organism => $arg{organism} || croak("no organism"), _chromosome => $arg{chromosome} || "????", _pdbref => $arg{pdbref} || "????", }, $class; }
First, you can see that the first parameter is stripped out
and is the name of the class—in this case, "Gene1". But the first
line also sets the value of %arg. Recall that parameters are passed as a single
array (or list). Thus, %arg is set equal to all the parameters following the
first. If you look up at the invocation of new(), you'll see something like
organism => "Homo sapiens", chromosome => "23",
... Notice
that the => operators do not appear within an anonymous hash area creation
operator, {}. The => operator is really just short hand for wrapping its
left argument with double-quotes and placing a comma between its operators.
Thus, a => b
is equivalent to "a", b
.
So, the assignment to %arg actually creates a hash from the array/list of values
passed to new(). The => syntax is used in the invocation to emphasize that
we're using named parameters rather than positional parameters. (Lisp is another
language that provides for both positional and named parameters.)
Now notice that new() is returning a reference to an anonymous hash, whose keys have been conventionally defined as being almost the same as the "instance fields" we are defining for the class (the preprending of a '_' is a convention for distinguishing identifiers that are known only internally to a package.) The logical or operator '||' is used to allow for short-circuit evaluation. If you examine the operation closely, you'll see that the eventual behavior is that the program will termintate if new() is called without providing at least a 'name' and 'organism' parameter, and that if the other parameters are not provided, they will be defined as having an initial value of "????".
Note that when the {} operator is being used to access a hash
value, the argument need not be quoted. Thus, $arg{name}
is identical
to $arg{'name'}
.
The last line of new() invokes the bless operator. bless() returns it first argument, a reference, but that reference will have been marked with the name of the class which is its second argument. This attached value can be accessed via the ref function, which should be invoked on a reference. The function returns a string indicating the type of reference (SCALAR, ARRAY, etc.) If a reference has been blessed into a package/class, then ref returns the name of that class. Try a few examples: make some references and use the ref function on them.
At the bottom of Gene1 we define several accessor functions:
sub name { $_[0] -> {_name} } sub organism { $_[0] -> {_organism} } sub chromosome { $_[0] -> {_chromosome} } sub pdbref { $_[0] -> {_pdbref} }
When these functions are invoked, $_ will refer to the argument list passed to each of them. Thus, $_[0] will be the value of the first parameter. We expect these functions to be invoked via the -> operator (as in, $bob->name), so the first argument should be an implicit reference to the object invoking the function (like the this reference in Java). In the accessors, then, the '->' operator is our old friend the dereferencing operator. As we have defined Gene1 objects to be implemented as hashes, the dereferenced parameter is a hash, and so we use the {} operator to access a value in the hash by using a key (either '_name', '_organism', etc.)
If I put that demonstration program code into file testGene1, I get this output from running that demonstration program with perl testGene1:
Object 1: Aging Homo sapiens 23 pdb9999.ent Object 2: Aging Homo sapiens ???? ???? Object 3: no name at testGene line 35
Class variables can be defined in Perl via closures, another technique borrowed from languages such as Lisp. A closure is a subroutine that uses a variable defined outside the subroutine. For example:
{my $bob = 3; sub matt { $bob++; }}
Normally, a local variable (defined via my) would be allocated when its block was entered, and then freed when the block was exited. Here, though, $bob will persist because the compiler knows that there is a still at least one valid reference to that value extant. Namely, in the subroutine matt. Consequently, $bob will persist across subsequent invocations of matt.
We use this trick to define class variables, such as we have in C++ and Java—variables that exist for the class, rather than for each instance of the class. The class variable $_count is defined in the class Gene2 in Gene2.pm:
{ my $_count = 0; sub get_count { $_count; } sub _incr_count { ++$_count; } sub _decr_count { --$_count; } }
We define three functions within this closure. All of them will be able to access the persistent variable $_count. The names of _incr_count and _decr_count are preprended to indicate (by convention) that these methods are only intended to be called from within this package/class.
The $_count variable will be used to keep track of how many instances of this class are extant. Note the invocation of _incr_count in the new() constructor.
The Gene2 class also defines mutator functions, one for each attribute (instance variable) of the class. Let's look at just one of them:
sub set_name { my ($self, $name) = @_; $self -> {_name} = $name if $name; }
The assumption is that this method would be invoked by means of something like: $bob->set_name("bob"); Note the clever use of the if operator here. That line could also be written as
if ($name) { $self->{_name} = $name }
but the author was trying to show off a bit! The Gene2 class is tested in testGene2.
The third version of the Gene class, Gene3, introduces the AUTOLOAD system of Perl. When used, this provides for a user-defined subroutine named AUTOLOAD to be invoked whenever a program tries to invoke a function that is not already defined. At that time, a global variables $AUTOLOAD will be set to the name of the offending subroutine. To provide for a global variable, we use the declaration 'our $AUTOLOAD' at the top of the file so that 'use strict' won't complain when it sees our AUTOLOAD subroutine using that (global) variable.
The most complicated code in Gene3 is that involving AUTOLOAD. The subroutine is used to define the accessor and mutator functions for our class only as they are invoked for the first time. Before discussing the AUTOLOAD method in detail we first review closures, and, in particular, how anonymous subroutines work with closures. (We talked about references to subroutines in an earlier lecture.) I've borrowed the following text from perldoc perlref:
A reference to an anonymous subroutine can be created by using
sub
without a subname:
|
Note the semicolon. Except for the code inside not being immediately executed, a
sub {}
is not so much a declaration as it is an operator, likedo{}
oreval{}
. (However, no matter how many times you execute that particular line (unless you're in aneval("...")
), $coderef will still have a reference to the same anonymous subroutine.)Anonymous subroutines act as closures with respect to my() variables, that is, variables lexically visible within the current scope. Closure is a notion out of the Lisp world that says if you define an anonymous function in a particular lexical context, it pretends to run in that context even when it's called outside the context.
In human terms, it's a funny way of passing arguments to a subroutine when you define it as well as when you call it. It's useful for setting up little bits of code to run later, such as callbacks. You can even do object-oriented stuff with it, though Perl already provides a different mechanism to do that—see perlobj.
You might also think of closure as a way to write a subroutine template without using eval(). Here's a small example of how closures work:
|
This prints
|
Note particularly that $x continues to refer to the value passed into newprint() despite "my $x" having gone out of scope by the time the anonymous subroutine runs. That's what a closure is all about.
This applies only to lexical variables, by the way. Dynamic variables continue to work as they have always worked.
Look again at the definition of newprint() above. Note that when the
sub
function is invoked a variable named $x
is already
visible: it is a lexical variable because it has been declared with my
.
(Dynamic variables—also borrowed from Lisp—relate to dynamic scoping. This
is a pretty ugly topic, so we'll just avoid it here. If you are really interested,
read up on the local
operator.) Now the behavior of the two subroutines
defined by newprint might seem odd to you. How come the code didn't
print "Greetings, world!" & "Greetings, earthlings!"?
On the face of it, both the anonymous subroutines refer to the same variable,
$x
. How can $x
appear to have two different values
(i.e., "Howdy" and "Grettings")?
Each time newprint() is invoked, a local variable named $x is created.
Normally when a block is exited, such as the body of newprint(), the
local variables there would be deallocated. Here, though, we are creating an
anonymous subroutine, and thus a closure within this block. Thus when newprint()
returns, the $x
in the newly minted anonymous subroutine continues
to access the value of that variable ("Howdy"). The second time newprint()
is invoked, another local variable named $x
is created—i.e., a
new chunk of memory is allocated to hold the value of $x
, which
references it. Thus the second anonymous subroutine exists in its own closure,
and $x
, there, refers to the string "Greetings".
By the way, if you look at at our first examination of closures (as a mechanism for providing for class variables), you'll see that the three subroutines get_count, _incr_count and _decr_count are all defined in the same lexical block, and thus, $count does refer to the same memory location in each of those subroutines. (The case is slightly different than newprint because named subroutines are being created, and those creations occur at compilation time rather than run time.)
Okay, now that we've covered closures and anonymous subroutines, let's look at the AUTOLOAD code. In particular, look at the code for creating the accessor functions:
# AUTOLOAD accessors if($operation eq 'get') { # define subroutine *{$AUTOLOAD} = sub { shift->{$attribute} };
To understand the code, consider the first time get_name() is invoked on a
Gene3 object. Because no such method has been defined for the class/package,
AUTOLOAD is invoked. At the time this if
statement is executed
$AUTOLOAD
will be the string "get_name" and $attribute
is a lexical variable (defined by my
) referring to the string "_name".
Now we see why we had the statement "no strict 'refs'" at the top
of this section. The code *{$AUTOLOAD} is a symbolic reference because $AUTOLOAD
is a string. Normally, because we have "use strict" in our programs,
symbolic references are disallowed. The "no strict 'refs'" statement
relaxes this restriction.
This is the first time we've seen the dereferencing operator '*'. It refers
to the symbol table entry for the reference. The symbol table holds all the
values that can be referenced by a given string, including a subroutine, a scalar,
a hash table, an array, etc. In effect, each string in the symbol table has
several slots (attributes) associated with it. Because we are assigning a value
to the symbol table entry, and that value is the reference to a subroutine,
we are setting the subroutine that will be invoked when we subsequently use
"get_name" as a subroutine name. (For example, in $bob->get_name()
,
etc.) If the code had instead been:
*{"get_name"} = \"bob";
This would change the scalar value that "get_name" access. Subsequent execution of the code print $get_name would output "bob". Here's a small Perl session involving manipulation of the symbol table:
DB<9> *{"xyz"} = \"bob" DB<10> print $xyz bob DB<11> print @xyz DB<12> *{"xyz"} = [1,2,3] DB<13> print "@xyz" 1 2 3
Now, let's reeturng to looking at the subroutine created in the AUTOLOAD code:
*{$AUTOLOAD} = sub { shift->{$attribute} };
During creation of the anonymous subroutine, $attribute
is lexically
visible, so, in our example, the resulting subroutine exists within a closure
in which the variable $attribute
refers to the string "_name".
The syntax of the defined subroutine is a bit weird. shift is not the name of
a class, but rather the standard shift function. Because no argument is given,
the usual implicit argument @_
is understood, so shift
,
all by itself, is the same as shift @_
and simply returns the first
argument to the subroutine. Thus, the subroutine might also be defined as:
sub { my $self = $_[0]; $self->{$attribute} };
In other words, the subroutine accesses its first parameter—it should
be a reference to a hash—and accesses the value mapped there to the keyword
that is in $attribute
. (If you were really paying attention to
the discussion of anonymous subroutines in closures, above, you might note that
$_ is not a lexical variable, and hence is not part of the closure here.)
Continuing our example, from now on get_name()
will refer to this
subroutine. When it is executed, because $attribute
is in a closure,
and no other subroutines can access this variable, it will always have the value
that it did when the subroutine was created, namely "_name". Thus
when $bob->get_name() is next executed, it will return the value in mapped
to the keyword _name in the hash that $bob references. (Recall that we are defining
our "objects" as hashes, and that when a method is invoked, a reference
to the object—$bob
, here—becomes the first argument to the method.
Thus that shift, above, evaluates to a reference to $bob
.)
The AUTOLOAD code that defines the mutators is similar:
*{$AUTOLOAD} = sub { shift->{$attribute} = shift; };
Here there are two shifts. Because the assignment operator evaluates its left argument first, the leftmost shift will be evaluated as the first argument to the resulting subroutine, and the shift on the right side will evaluate to the second argument.
The DESTROY subroutine is automatically invoked when a local variable referencing an object in the package goes out of scope (for example, returning from a function call or exiting a block). The only argument to DESTROY is a reference to the object being destroyed. In this case all we want to do is keep decrement our counter of extant Gene3 objects, so the code is very simple:
sub DESTROY { my($self) = @_; $self->_decr_count(); }
Here is the code for testing Gene3, testGene3.
The final class presented in chapter 3 of the textbook is Gene, in the file Gene.pm. Most of the techniques introduced in this code are explained quite well in the text, so I'll just hit the high points.
Probably the biggest change between Gene3 and Gene is that the new class provides
a hash, called %_attribute_properties
, that explicitly lists the
names of the instance variables in the class. In the previous Gene classes,
the names were provided as keys in the anonymous hash that was put in $self
.
(See the constructor in Gene3.) With
these old classes, we could get the names of the instance variables by using
the keys function on any given object. By making %_attribute_properties
a class variable, though, the list of key names can be accessed even without
having an object. (For example, when we are creating the first instance of the
class.) Here is the code from Gene.pm:
my %_attribute_properties = ( _name => [ '????', 'read.required'], _organism => [ '????', 'read.required'], _chromosome => [ '????', 'read.write'], _pdbref => [ '????', 'read.write'], _author => [ '????', 'read.write'], _date => [ '????', 'read.write'] );
Pay particular attention to how this structure is used via the _all_attributes()
method in the new()
constructor and in clone()
.This
code also provides a more structured method of defining default values for each
instance variable. Here they are all '????', but you could put anything there.
The 'required' keyword in the code indicates that the constructor requires an
initial value for a field, and in that case, any default value would be ignored.
The 'read' and 'write' keywords are used by the accessor and mutator functions
to determine access priveleges to the various fields. (See how the internal
method _permissions()
is used.)
The clone()
method defined here is a bit different from similar
methods in other languages, like C++ and Java. In particular, this method allows
for the user to provide field values that will override those that would normally
be cloned. See the use of clone()
near $obj3
in the
code for testing the Gene class. This makes the usage of clone()
very similar to that of new()
.
Perl Documentation
Perl has a system, called POD or plain old documentation, for providing external documentation that is similar to Java's javadoc tool. The programmer embeds the documentation with the .pl or .pm file, and a accesses this documentation via the perldoc tool. So to see the documentation for the Gene class, make sure you're in the same directory as Gene.pm and use the command
perldoc Gene.pm
POD is very simple. It interprets everything between the keywords =head1
and =cut
as being documentation. Btw, the Perl compiler always
ignores everything between these keywords. It is up to the programmer to ensure
that the documentation conforms to the accepted conventions. The easiest way
to do this is to look at the perldoc for other well-known modules and follow
their lead. Here is the result of running perldoc Gene.pm
:
Gene Gene: objects for Genes with a minimum set of attributes Synopsis use Gene; my $gene1 = Gene->new( name => 'biggene', organism => 'Mus musculus', chromosome => '2p', pdbref => 'pdb5775.ent', author => 'L.G.Jeho', date => 'August 23, 1989', ); print "Gene name is ", $gene1->get_name(); print "Gene organism is ", $gene1->get_organism(); print "Gene chromosome is ", $gene1->get_chromosome(); print "Gene pdbref is ", $gene1->get_pdbref(); print "Gene author is ", $gene1->get_author(); print "Gene date is ", $gene1->get_date(); $clone = $gene1->clone(name => 'biggeneclone'); $gene1-> set_chromosome('2q'); $gene1-> set_pdbref('pdb7557.ent'); $gene1-> set_author('G.Mendel'); $gene1-> set_date('May 25, 1865'); $clone->citation('T.Morgan', 'October 3, 1912'); print "Clone citation is ", $clone->citation; AUTHOR A kind reader COPYRIGHT Copyright (c) 2003, We Own Gene, Inc.