Now that we've got the basics of Perl under our belts, it's time to move on to more advanced topics. To begin with, let's look a bit more into Perl modules and packages. Recall that we can define namespaces via use of the "package" statement. This is similar to the way package statements work in Java and Lisp and namespace statements work in C++. Programs can use multiple variables and subroutines with the same name provided that each is defined within a unique namespace, i.e., within a unique package. Thus the code
package Matt; my $address = 1234; package Bob; my $address = 2345;
is completely legal. Two variables called "$address" are defined, one in the Matt package and one in the Bob package.
In general, programs are composed of subroutines from many different files. This becomes more and more the case as you develop code--you will find new uses for old subroutines. To facilitate this re-use, good software engineers define modules. Each module is a single file, named after the module, with a name ending in ".pm", and which has a last line containing just
1;
Usually, you should place the code in a module within a package. Thus the first line of the file will usually be something like:
package moduleName;
As an example, consider the module BeginPerlBioinfo we've been using so far. The module is stored in the file BeginPerlBioinfo.pm, and the contents of that file should be ("should be" because Tisdall avoided using the package statement in his first book) something like:
package BeginPerlBioinfo; ############################################################ # # BeginPerlBioinfo.pm # - a library of subroutines # from the examples and text in the book: # # Beginning Perl for Bioinformatics # by James Tisdall # ....body of the file omitted here.... # Finally, calculate and return the translation return dna2peptide ( substr ( $seq, $start - 1, $end -$start + 1) ); } 1;
For a program to make use of the subroutines within a module, use the use statement:
use moduleName;
Perl finds locates modules by searching for a .pm file with the given name. The question is: where does Perl look? One way to do this is to specify the directory where the module(s) are stored via a command line -I argument, such as:
perl -I../src bob.pl
Now if bob.pl contained the line
use MattsModule;
Perl would look in the directory "../src" (as well as in some other directories) for the file MattsModule.pm. As to where else Perl would look, check out the built-in array @INC. This specifies the search path for Perl modules. I use the Perl debugger to print the contents of the array on my machine:
$ perl -d Loading DB routines from perl5db.pl version 1.19 Editor support available. Enter h or `h h' for help, or `man perldebug' for more help. DB<1> print join("\n", @INC), "\n"; /usr/lib/perl5/5.8.0/cygwin-multi-64int /usr/lib/perl5/5.8.0 /usr/lib/perl5/site_perl/5.8.0/cygwin-multi-64int /usr/lib/perl5/site_perl/5.8.0 /usr/lib/perl5/site_perl . DB<2>
When needing to find a module, Perl will scan through these directories, one after the other, looking for the necessary .pm file. The last entry in the list is ".", which is just the current working directory; Perl will always search for modules defined in the same directory as the Perl file you are running. Another way to add to this search path is via a use lib statement in your code, such as:
use lib "C:/Documents\ and\ Settings/Matt\ Evett/src/perl/";
This instruction simply adds another directory to the search path for the file containing this line. You can also modify this search path through the use of environment variables. For full documentation (as usual) see the perlmod and perlmodlib documentation at www.perldoc.org.
Okay, so let's take a look at a module in use by looking at Tisdall's testGeneticcode1. (He omits a ".pl" suffix, probably because he is mostly working in Unix, and Unix does not require filetype suffixes. This seems more lazy than anything else. You should always include the .pl suffix for your own file names.) This file uses the Geneticcode1 module.
When the line
use Geneticcode1;
is executed in testGeneticcode1,
the code in Geneticcode1.pm is evaluated, thus defining the hash %genetic_code
.
This happens only once, so repeated evaluations of the code2aa subroutine (which
we saw in an earlier lecture) are very fast--the hash need not be instantiated
with each invocation. Because the hash is declared within a my
construct, its scope is the Geneticcode1 module alone. Because the subroutine
code2aa is not defined within a my construct, it is visible to programs using
the module. This is an example of the trait of encapsulation, much desired
by respectable software engineers!
The module Geneticcode2 is a modification of Geneticcode1, and contains the subroutines necessary for converting a sequence of dna into a peptide sequence. The SequenceIO module contains subroutines for extracting DNA sequences from FASTA files, and printing them as peptide sequences. (We've seen all these subroutines before.)
CPAN is an on-line library of zillions of Perl modules, all free for the borrowing. There is a search tool there for locating modules. Having found them, you can download and install them in your system. As an example, let's install the Statistics::ChiSquare module that Tisdall mentions in his book. This is a tool for determining the probability that a collection of numbers is "random" (meaning, in this case, normally distributed).
Having run the search for "ChiSquare" on the CPAN web page you'll get a page containing a number of entries looking like this:
Statistics::ChiSquare
How well-distributed is your data?
Statistics-ChiSquare-0.5 - 16 Nov 2003 - David Cantrell
The top link is to the module's documentation. The bottom link is to the download page for the module. Go there and download the module's installation package. In this case, the package is misnamed "Statistics-ChiSquare-0.5.tar.gz.tar"the double "tar" is a misnomer. Once I downloaded the file, I used Unix's "file" command to determine that the file was actually a gzip file, so I changed the named to just "Statistics-ChiSquare-0.5.tar.gz". From there I unzipped, then untarred the file. A glance at the README file showed me what I had to do to finish the installation. A complete transcript of the installation can be found here.
If you are running Perl on a Linux or Unix system, there is an even easier way to install modules: you can use Perl's built-in CPAN.pm module. (Check out the documentation via perldoc CPAN). To do this, locate a module you want to install, say, "Statistics::ChiSquare". Then you use this command:
$ perl -MCPAN -e 'install Statistics::ChiSquare'
The -MCPAN argument instructs perl to use the CPAN module (which should've been included in your installation of Perl.) The install subroutine of that module does the dirty work. Now, the first time you run the install subroutine, you have to configure CPAN. Here's a transcipt of how I did that. If you look at the transcript, you'll see that everything went smoothly until we got to the wanting to find the lynx program. (Lynx is a simple text-only web browser.) At this point I actually started another cygwin window and downloaded and installed lynx (ask me about this if you're really interested. I don't think it is necessary for your CPAN to work, so long as you have an ftp executable on your system someplace.) If you don't have a component that CPAN wants during this configuration, just try hitting the enter key.
After the configuration, CPAN actually goes about downloading and installing the module. Check out the transcript.
Okay, now that we've installed the module (supposedly), let's see if we can use it. First, check to see if the module installation has made the documentation available. Try:
perldoc Statistics::ChiSquare
You should see documentation similar to this:
NAME "Statistics::ChiSquare" - How well-distributed is your data? SYNOPSIS use Statistics::Chisquare; print chisquare(@array_of_numbers); Statistics::ChiSquare is available at a CPAN site near you. DESCRIPTION Suppose you flip a coin 100 times, and it turns up heads 70 times. *Is the coin fair?* Suppose you roll a die 100 times, and it shows 30 sixes. *Is the die loaded?* In statistics, the chi-square test calculates how well a series of .....the rest is omitted for brevity....
Next, you might try running this testing program. It checks the degree of randomness of two things: the distribution of subway stops along the Broadway ("red") line in New York City, and the rolls of a virtual die as simulated by Perl's built-in random number generator. Execution of the program shows that the subway stops are most probably not randomly distributed (not surprising, given the stops were planned), and that (on my machine at least) the random number generator is not particularly "random".