Data Structures in Perl

In a previous lecture we discussed references, and how references can be used to pass arrays as parameters to subroutines, or to retrieve values from subroutines. We can also reference scalar constants such as:

$peptideref = \'EIQADEVRL';

print "Here is what's in the reference:\n";
print $peptideref, "\n";

print "Here is what the reference is pointing to:\n";
print ${$peptideref}, "\n";

Producing the output:

Here is what's in the reference:
SCALAR(0x80fe4a0)
Here is what the reference is pointing to:
EIQADEVRL

Printing the reference itself just prints the address at which the constant is stored. Because the scalar is not stored in a named variable, this is called an anonymous reference. This is not particularly useful with scalars, but we can also do this with data structures, and that will be useful.

Double-referencing (or "why people fear C")

It is possible to reference references, as well (or to reference references to references....) Such a double-reference is usually referred to as a "handle" in programming literature. To access the value at the end of such a string of references, you have to use multiple dereference operators. (Similar to the "*" dereference operator in C and C++.) Here is some sample code from Tisdall's book that does this:

$value = 'ACGAAGCT';
$refvalue = \$value;
$refrefvalue = \$refvalue;

print $value, "\n";
print $$refvalue, "\n";
print $$$refrefvalue, "\n";

This prints out:

ACGAAGCT
ACGAAGCT
ACGAAGCT

References to arrays, hashes and subroutines can also be dereferenced with the "->" syntax, borrowed from C. So if $mattarr is a reference to an array, then I can access the first element of that array via either $$mattarr[0], or via $mattarr->[0]. If $mattarr is a reference to a two-dimensional array (an array of arrays), we can use the arrow operator twice: $$mattarr[1][0], or $mattarr->[1][0], or $mattarr->[1]->[0]. The last two examples do the same thing because Perl doesn't provide actual 2-dimensional arrays, as will be explained in a section below.

Anonymous hashes and arrays

We can create anonymous references to hashes and arrays via similar syntax. Previously, we have defined references to hashes via syntax such as:

%mattshash = ('brilliant' => 'indeed', 'figure' => 'heavenly');
$ref = \%mattshash;
print "$$ref{'brilliant'}\n";

A reference to an anonymous hash looks like this:

$mattshash = {'brilliant' => 'indeed', 'figure' => 'heavenly'};
print "$$mattshash{'brilliant'}\n";

Note the use of curly-brackets instead of parentheses. We can create anonymous arrays the same way, using square brackets instead of parentheses when defining the array:

$mattsarr = ['brilliant', 'smart', 'funny'];
print "$$mattsarr[1]\n";

References to subroutines

These are similar to function handles or function pointers in other programming langauges. Sometimes we would like to be able to pass a function as an argument to another function, such as the find subroutine we looked at in a previous lecture. This enables the called function to invoke the other function. To create a reference to a subroutine, we use the usual reference generating operator '\' in conjunction with the (optional, but was required in earlier Perls) method identifier '&'. Thus, to create a reference to the subroutine code2aa we do this:

$subref = \&code2aa;

Then to invoke the subroutine we use the dereferencing operator:

&$subref('cgatcgatcgat');

Here is a sample piece of code from Tisdall's book that uses subroutine references. There are some errors in the code. Can you fix them?

While Tisdall's code illustrates the syntax of subroutine references, it doesn't present much of an argument for their use. Suppose you wanted to write some code that took an array of integers and printed the array that would result from incrementing each element of the original array by one. In addition, suppose you also wanted to write some code that printed all the values of an array decremented by one. It would be easy enough to write these two subroutiens. But perhaps you might realize that applying a function to each element of an array is a common pattern in programming. Rather than writing both of these subroutines, we could write a single subroutine that took an array as one argument, and a reference to a subroutine as a second argument. This subroutine could then iterate across the members of the given array, applying the given subroutine to each. Here is code that would do that:

# Takes two arguments: reference to an array of integers, reference to 
# a subroutine to be applied to all elements in the array.
sub applyToAll {
    my ($arrayRef, $subRef) = @_;
    foreach (@$arrayRef) {
	  $subRef->( $_);
    }
}

sub increment {
    my ($arg) = @_;
    print "Incremented value is ", $arg+1, "\n";
    return $arg+1;
}

sub decrement {
    my ($arg) = @_;
    print "Decremented value is ", $arg-1, "\n";
    return $arg-1;
}

my @arr = (21, 22, 23);
print "\nThe original array is @arr.\n\n";
print "First we'll try incrementing all of them:\n";
applyToAll ( \@arr, \&increment);
print "\nNow we'll try decrermenting all of them:\n";
applyToAll ( \@arr, \&decrement);

When this code is executed, we get:

The original array is 21 22 23.

First we'll try incrementing all of them:
Incremented value is 22
Incremented value is 23
Incremented value is 24

Now we'll try decrermenting all of them:
Decremented value is 20
Decremented value is 21
Decremented value is 22

Now there isn't any really significant savings here, we've only eliminated the need for coding a single foreach loop. But we could use our applyToAll subroutine anywhere we wanted to iterate over the elements of an array, saving that coding effort each time. Moreover, it is at least arguable that the use of applyToAll is more transparent to the reader than is a foreach loop. Here's another example where the savings are a bit more obvious. Here we create a subroutine, collectAllApplies, that is similar to applyToAll except that it also returns an array whose elements are the results of the applications of the given subroutine to each of the elements of the given array:

sub collectAllApplies {
    my ($arrayRef, $subRef) = @_;
    my @result = ();
    foreach (@$arrayRef) {
	    push @result, $subRef->( $_);
    }
    return @result;
}

Here is code to exercise this new subroutine:

my @result = collectAllApplies( [31, 32, 33], \&increment);
print "Result of all the increments is: @result\n";
@result = collectAllApplies( [31, 32, 33], \&decrement);
print "Result of all the decrements is: @result\n";

When this code is executed it outputs:

Incremented value is 32
Incremented value is 33
Incremented value is 34
Result of collection is: 32 33 34
Decremented value is 30
Decremented value is 31
Decremented value is 32
Result of all the decrements is: 30 31 32

The savings are more pronounced here: each invocation of collectAllApplies saves us from writing several lines of source code.

Anonymous Subroutine References

We've already seen how to make references to anonymous arrays, scalars and hashes. Likewise we can create a reference to an anonymous subroutine. Here's an example:

$mysub = sub {
    my ($arg) = @_;
	print "Incremented value is ", $arg+1, "\n";
};
$mysub->(3);
&$mysub(6);

This code generates the output:

Incremented value is 4
Incremented value is 7

Anonymous subroutines like this are more useful than you might think. Many languages provide for them. Lisp has a similar construct called "lambda expressions" (a term borrowed from mathematics, expressing the reification of a behavior or function). Java provides for anonymous inner classes, which are often used in conjunction with the callback methods in Java's event-driven GUIs. C++ provides for anonymous functions in conjunction with the collection templates in the STL. The basic idea of all these constructs is similar: there's nothing really gained by attaching a name to a subroutine if it is only going to be used in one place. As an example, let's say we wanted to use our collectAllApplies function to print the array that would result from squaring all the elements of a given array. Now, we could define a subroutine, square(x), that returns the square of its argument. But if this is the only place where we will be invoking that subroutine, then we might as well leave it anonymous to avoid cluttering the namespace with an extra subroutine name::

my @resultArr = collectAllApplies( \@arr, 
          sub { my ($arg) = @_;
			     return $arg * $arg; });
print "Result of squaring all the elements in ( @arr ) is ( @resultArr)\n";

Which outputs:

Result of squaring all the elements in ( 21 22 23 ) is ( 441 484 529)

Symbolic References

Symbolic references are really just the name of a variable, rather than a pointer to the address at which that variable's values are stored. The curly-bracket operator is used to "dereference" those names:

@bob = ( 'bob', 'sue');
@sue = ( 1, 2, 3);
$arrayname = 'bob';
print "Symbollically refd array: @{$arrayname}\n";
print "element of that array is name of another array: ${$arrayname}[1]\n";
print "That other array: @{${$arrayname}[1]}\n";

which when executed yields the output:

Symbollically refd array: bob sue
element of that array is name of another array: sue
That other array: 1 2 3

Matrices

We've already investigated one dimensional arrays. Perl doesn't have a built-in data type for multidimensional arrays (called "matrices" in many other languages). Perl provides for this construct via arrays of arrays, so we can have:

@matrix = ( [1, 2, 3], [4, 5, 6] );

which defines @matrix to be an array of two elements, each of which is a reference to a three element array. Or, in other words, a 2x3 matrix. Thus $matrix[1][2]refers to the element in the array in row 1, column 2, which is 6. The observant reader might wonder at this syntax. The declaration of @matrix makes it clear that each element of that matrix is actually a reference to an anonymous array. So shouldn't we access element [1][2] via the code $matrix[1]->[2] ? Actually, you can--that code will work perfectly well. Our original code, $matrix[1][2], will also work. Why? Perl doesn't support native multidimensional arrays, so consequently when Perl sees the syntax [1][2], it knows that it is dealing with an array of references to arrays. Perl therefore implicitly applies the first dereference (the "$matrix[1]->" part) to access an array to which it can then apply the "[2]" operation. Here is some code that illustrates this:

# Below, we see that there is more than one way to access elements
# of a 2-d array.
@matrix = ( [1, 2, 3], [4, 5, 6] );
print "survey says! ", $matrix[1]->[2], " is same as ", $matrix[1][2], ".\n";

When the code is evaluated we get:

survey says! 6is same as 6.

Perl arrays/matrices are dynamic; they don't have to be declared. Consequently, accessing an element of an array/matrix that has not been previously accessed will immediately define that element. Thus, the following code defines and prints a small multiplication table:

# Declare reference to (empty) anonymous array
$array = [  ];

# Initialize the array
for($i=0; $i < 4 ; ++$i) {
  for($j=0; $j < 4 ; ++$j) {
      $array->[$i][$j] = $i * $j; #Note "->" syntax
  }
}

# Print the array
for($i=0; $i < 4 ; ++$i) {
  for($j=0; $j < 4 ; ++$j) {
      printf("%3d ", $array->[$i][$j]);
  }
  print "\n";
}

yields the output (remember the printf statement? We've seen that before.)

  0   0   0   0
  0   1   2   3
  0   2   4   6
  0   3   6   9

Here's a somewhat subtle syntax question: what is the difference between $array->[$i][$j] and $array[$i][$j] ?

Now because matrices are defined dynamically, you might wonder what the statement @array->[1000][1000] would do. Would that single statement allocate 1001 rows, each with 1001 elements, i.e., the memory to hold a million entries? Well, not quite! Instead, Perl will see that it needs the outermost array (@$array) to have at least 1001 elements. Currently it has only 4, so Perl will allocate enough memory to hold another 997 entries. It will then see that $array->[1000] should be an array that holds at least 1001 elements, so it will allocate that memory and place a reference to it in $array->[1000]. So, execution of this one statement will allocate 997+1001 = 1998 words of memory. Yikes!! To avoid that we can constitute sparse matrices by using hashes instead of arrays. In this scheme, the "indices" into the "matrix" are actually keys to hashes. Consider this code:

$hashArr = {};
$hashArr->{2}{1000} = 'matt';
$hashArr->{500}{128} = 'cindy';

The first line defines a reference to an anonymous hash, call it Z for clarity. The second statement does two things. First, it defines an anonymous hash, (call it X), that contains a single key, 1000, that maps to 'matt'. Second, it defines that the key '2' will map to a reference to X in Z. The second line is similar. We have to be careful in iterating over sparse arrays because merely accessing a hash via a key will cause a corresponding "slot" for that key in the hash table to be created. Thus evaluation of the code

if ($hashArr->{50}{333} == 0)

would cause the creation of the {50}{333} slot, allocating memory. (Printing the value of that slot would return "undef" in this case, but the memroy has been allocated.) This could be a potential problem when attempting to iterate over the elements of a sparse array. To avoid needlessly allocating slots in the table, you should make judicious use of the exists operator, such as in the code:

for(my $i=0 ; $i < 100 ; ++$i) {
    for(my $j=0 ; $j < 100 ; ++$j) {
        if( exists($array->{$i}) and exists($array->{$i}{$j}) ) {
            print "Array element row $i column $j is $array->{$i}{$j}\n";
        }
    }
}

Do you see why there are two exists above?

So now we know all about arrays and hashes. Obviously we could mix and match these, creating arrays whose elements are hashes and vice versa. What can you infer about the data structure involved in this statement:

%{$array[3][4]{'foo'}}

When we add anonymous references to arrays and hashes we get potentially even more complex and powerful structures, such as this chunk of code from Tisdall's book:

$gene = [
    # hash of basic information about the gene name, discoverer,
    #  discovery date and laboratory.
    { 
        name       => 'antiaging',
        reference  => [ 'G. Mendel', '1865'],
        laboratory => [ 'Dept. of Genetics', 'Cornell University', 'USA']
    },

    # scalar giving priority
    'high',

    # array of local work history
    ['Jim', 'Rose', 'Eamon', 'Joe']
];

print "Name is ", ${$gene->[0]}{'name'}, "\n";
print "Priority is ", $gene->[1], "\n";
print "Research center is ", ${${$gene->[0]}{'laboratory'}}[1], "\n";
print "These individuals worked on the gene: ", "@{$gene->[2]}", "\n";

Take the time to understand each of the print statements at the end of the code above.

The big drawback with these complex data structures is that there is no way to enforce their structure. I.e., suppose we wanted multiple structures, like the one referenced by $gene above, in an array. We have no way to enforce that all elements in the array conform to that structure. In addition, there are no names given to the various substructures. Instead the programmer has to rely upon internal documentation to describe that genes are "arrays of three elements, the first of which is a hash that contains the keys..." , etc. What is needed, of course, is something like classes in C++ and Java. Perl does provide these, and we'll get to them in the next lecture, so stay tuned!

Because complex data structures can be so common, Perl provides a standard module, Data::Dumper, to print in human-readable format ("pretty print", in the Lisp world) the contents of a structure. The module provides very involved syntax for formatting the output of the structures. You should take a look at perldoc Data::Dumper if you are interested in a higher degree of control over the formatting. The output from Data::Dumper can actually be used (evaluated, or "eval'd"--another term borrowed from Lisp) to generate a clone of the original. Indeed, Data::Dumper provides tools for dumping a printed version of a structure to a file so that the file can be used by Data::Dumper in the future to reconstitute the original structures. This capability is similar to serializiable components in Java.

For more information on arrays, see perldoc perllol.