Hacking · Perl

Little Languages and Tables

Recently, a coworker whipped up a Perl script that’ll build all of the
Perl modules we support. This is useful for when we add a new
supported OS or OS version. This script takes a config file, moduledefs, which lists the modules to build, as well as various quirks that affect how and whether the modules should be built. moduledefs is itself a `require‘d Perl script:

# hash of module names (as known to perl) and parameters. 
# value is an array of parameters, as follows:
#   index 0:    build directory. If no build directory is given, 
#               we assume it is the same as the module name, 
#               changing :: to -
#   index 1:    don't make test. This field is a regex of AFS
#               sysnames not to test on. If this is not set,
#               we make test everywhere. If it is set,
#               and sysname matches, we don't make test
#   index 2:    regex of AFS sysnames not to build on. If this is 
#               not set, we build everywhere. If it is set,
#               and sysname matches, we don't build
# 
%MODINFO = (
        "ARS" => [ "ARSperl", ".*", "alpha_dux|(amd64|i386)_rel30" ],
        "Authen::Krb4" => [ "Krb4" ],
        "CGI" => [ "" ],
        "Compress::Zlib" => [ "", "alpha_dux40" ],
        "Convert::ASN1" => [ "" ],
        "Convert::BER" => [ "" ],
        "Crypt::CBC" => [ "" ],
        "Crypt::DES" => [ "" ],
        "Crypt::IDEA" => [ "" ],

# array of module names (as known to perl) in the order they must be
# built in. 

@MODULES = (
        "ARS",
        "Authen::Krb4",
        "CGI",
        "Convert::ASN1",
        "Convert::BER",
        "Crypt::CBC",
        "Crypt::DES",
        "Crypt::IDEA",
        "Crypt::SSLeay",
        "DBI",                  # DBI needs to be before the DBD modules
        "DBD::ODBC",
        "DBD::Oracle",
        "DBD::Pg",
        "DB_File",

Don’t roll your eyes too much, because this is actually fairly sensible for our environment. But there’s a lot more punctuation than is necessary. The same effect could be achieved more compactly. The @MODULES list could be built with qw, e.g.:

@MODULES = qw(
    ARS
    Authen::Krb4
    CGI
)

but that wouldn’t allow us to have comments in the list, and comments could be useful. So instead, let’s read the list from a data file.

__DATA__

Now, all of the code above is already in an auxiliary file separate from the main script (the one whose job is to build and test the modules), so it would be inelegant to further pollute the directory with extra cruft. Fortunately, Perl has the magic token __DATA__, which means “this is the end of the Perl script, and the beginning of the special data section.” The data section can then be read via the special DATA filehandle. So we can write:

while (<DATA>)
{
    next if /^#/;       # Ignore comments
    chomp;
    push @MODULES, $_;     # Add the module name to the list
}
__DATA__
ARS
Authen::Krb4
CGI
Convert::ASN1
Convert::BER
Crypt::CBC
Crypt::DES
Crypt::IDEA
Crypt::SSLeay
# DBI needs to be before the DBD modules
DBI
DBD::ODBC

Reading Tables

Then there’s the %MODINFO hash, which contains information on how to build, whether to build, and whether to test on a given architecture. One drawback so far is that the information about a given module is in two separate places. Since the values in %MODINFO are arrays, we could just put these values in __DATA__, separated by commas or some other separator:

our %MODINFO = ();
our @MODULES = ();

while ()
{
    next if /^#/;       # Ignore comments
    chomp;

    my ($modulename, @fields) = split /s*,s*/, $_;
        # Split into fields on commas with optional whitespace
    push @MODULES, $modulename;
    $MODINFO{$modulename} = [ @fields ];
}
__DATA__
ARS, ARSperl, .*, alpha_dux|(amd64|i386)_rel30
Authen::Krb4, Krb4
CGI, 
Compress::Zlib, , alpha_dux40
Convert::ASN1
Convert::BER
Crypt::CBC
Crypt::DES
Crypt::IDEA
Crypt::SSLeay, , , alpha_dux40

Little Languages

The next observation is that the data table is sparse: for most modules, the defaults are sensible so there’s no need to specify more than the name of the module. In the majority of the other cases, there’s only one caveat, e.g.: “build Foo in the directory perlFooModule“, “build Bar, but not on Solaris 7 boxen”.

On top of that, we can imagine that in the future, it will be necessary to add other conditions to accommodate oddball modules: “when testing Foo, set $CLASSPATH to /usr/local/oddball-java”, or “Bar‘s tests require human intervention, so don’t make test when running in unattended batch mode”, and so forth.

For this, it’s worth defining a little language. A little language is usually a miniature language within a program, something with syntax and semantics, but not enough expressive power to be a full-fledged programming language, like regular expressions, embedded SQL queries, or the first argument to getopt(). Little Languages allow a programmer to compactly express some idea, often an application-specific one, that would normally take many lines of code to express otherwise.

In this case, let’s define the following syntax: if a line in the data file begins with whitespace, then it is not the name of a Perl module, but a qualifier to the preceding module. The qualifier itself takes the form “<qualifier> <value>“. Thus:

CGI
Compress::Zlib
    nobuild alpha_dux40
DBD::Oracle
    notest .*
    nobuild alpha_dux|sun4x_57

Under this scheme, it makes sense to consolidate @MODULES and %MODINFO into one structure. Let’s have the elements of @MODULES be anonymous arrays; the first element is the name of a module, and the second is an anonymous hash that maps qualifiers to values. If we were writing it out, we could write:

@MODULES = (
    [ "DBD::Oracle",
      { notest => ".*",
        nobuild => "alpha_dux|sun4x_57",
      }
    ],
);

Multi-Line Records

The first problem we encounter is that <DATA>, since it only reads a line up to the end-of-line character, is no longer guaranteed to read an entire record. The simple loop

while (<DATA>)
{
    # process a record
}

is no longer sufficient. You may be thinking that we need to write something like

while (<DATA>)
{
    $c = first character of the next line;
    if ($c =~ /w/)
    {
        # The record continues on the next line
        read in the next line;
    } else {
        # We've seen the entire record
        Process the record;
        Put back $c so we can see it in the next iteration of the while loop
    }
}

but there’s a much simpler approach: at thist point, we’re not building the modules; we’re just collecting information about them. This means that we can add information to a module that we’ve already seen. So we can just remember the last record we’ve seen:

our @MODULES = ();
our $lastmodule;        # Reference to last module seen

while ()
{
    next if /^#/;       # Ignore comments
    chomp;

    if (!/^s/)
    {
        # This is (the beginning of) a new module
        push @MODULES, [ $_, {} ];
        $lastmodule = $MODULES[-1];
    } else {
        s/^s+//;       # Trim leading whitespace

        my ($qualifier, $value) = split /s/, $_, 2;

        $lastmodule->[1]{$qualifier} = $value;
    }
}
__DATA__
CGI
Compress::Zlib
    nobuild alpha_dux40
DBD::Oracle
    notest .*
    nobuild alpha_dux|sun4x_57

Here, $lastmodule is a reference-to-array. Every time we add an entry to @MODULES (and these entries are references-to-array), we remember the last one we added. If we see a line that begins with whitespace, we can just say “oh, I need to add this information to the last module I saw”. This is a lot simpler than trying to implement lookahead.

Dependencies and Partially-Ordered Sets

The last thing I’ll note is that as currently implemented, @MODULES lists the modules in the order they must be built, but doesn’t say why that order is necessary.

The order comes from the fact that certain modules depend on other modules. They form a partially-ordered set: there are many correct orders in which to build the modules, but they all share the characteristic that DBD is built before DBD::Oracle, that Mail is built before MIME::Tools, and so forth.

Since we talked about adding arbitrary qualifiers, above, it would be nice to add a “requires” qualifier. This would allow us to keep the list of modules in any order we liked, and also to have the machine figure out a right order so we humans don’t have to waste time doing so. It would also make these dependencies explicit.

(Aside: Yes, the Clever Thing would be to read the Makefile.PL for a module and see which dependencies it lists. But in the real world, module authors make mistakes and sometimes forget to list a dependency.)

Under this scheme, instead of keeping @MODULES as a list of modules to build, we can keep %MODULES, an unordered set. Instead of storing a two-element anonymous array with the module name and its qualifiers, we can just have the module name be the key of %MODULES, and the anonymous hash of qualifiers be the value.

To implement the partial ordering, we just need to remember which modules have been built so far. We can do this either by keeping a separate %built hash keyed by module name, or by adding a qualifier to the values in %MODULES: just $MODULES{"Foo"}->{"built"} = 1; after building module Foo.

Now the main loop of the program becomes clear: after constructing %MODULES, go through it and look for a module that a) has not been built yet, and b) has no unbuilt dependencies. Build it and mark it as built. Repeat until there are no more modules to build.

3 thoughts on “Little Languages and Tables

  1. Whereas I didn’t realize that wiki-like notation applied to comments as well (that’s why you got DATA instead of underscore underscore DATA underscore underscore.

    And yeah, the lack of a preview button bugs me as well, Actually, a lot of things about the comments bug me. But WordPress 2.0 is out, allegedly with a whole new ultra-modular back-end. So hopefully they’ve either fixed comments to allow previews, nested responses, and captchas, or else they’ve made it easy for someone to implement them with a plug-in.

Comments are closed.