ooblick.com
Warning: May contain blasphemy.

Evolving a Self-Replicating Program

Contents

This was written in response to a message to talk.origins by Zoe Althrop in which she wrote:

if you can show me a program that is acknowledged to be a program, and
this program does not have a programmer, then my hypothesis will be
falsified.

I thought it would be interesting to "breed" a program using evolutionary computation. Specifically, I tried to "breed" a program which, when it was run, would print out a copy of itself. I failed at this, but learned some interesting things in the process.

A Brief Overview of Evolutionary Computing

If you've ever watched a "Making Of" documentary about a science fiction movie, you may have been shown piles and piles of proposed and discarded designs for costumes, space ships, aliens, and so forth. This illustrates the central idea behind evolutionary computing (EC): to come up with a good design, all you have to do is come up with a million designs, and throw out the bad ones.

There are many ways to write a program that uses evolutionary computing, but they all share the same basic outline:

  1. Come up with some proposed solution.
  2. Come up with variants on the best solution so far.
  3. Assign each candidate a "fitness score," which says how well the candidate works.
  4. Keep the candidate with the highest score (some schemes keep several top-scoring candidates) and throw out the rest.
  5. Start over at step number 2.

EC is best suited for problems in which there are too many variables for it to be possible to calculate a definite answer, and where it is possible to arrive at the best answer by successive improvements.

The trickiest part of the whole process is coming up with a way to assign a fitness score to a candidate. The problem is that the algorithm above will generate solutions with the highest score, which may not be what you want. This is like saying that in a democracy, the winner of an election is not the best candidate, but the candidate most able to win an election.

The Program

I chose to breed a Scheme program (Scheme is a dialect of the Lisp programming language), mainly because it has a very clean syntax that's easy to generate, partly because one of the traditional projects for students of Lisp is to write a program that prints itself, and partly because I had a Scheme interpreter (SIOD) lying around.

Constructing a Self-Printing Program in Lisp

In Lisp, the construct (quote foo) simply returns foo without evaluating it. The (list a b c) function takes its arguments and makes a list out of them (Lisp is very big on lists (and lists of lists)).

We can define a function quine that takes one argument, and returns a list with that argument, and a quoted version of the argument:

(define (quine arg)
  (list arg
        (list (quote quote)
              arg)))
          

So that (quine 123) would return (123 (quote 123)). Likewise, (quine (quote quine)) returns (quine (quote quine)).

So we almost have a program that prints itself. The only snag is that to make it work, we had to define the function quine and this program does not print the definition of that function.

But what if it were possible to achieve the same thing without calling quine by name? In Scheme, define simply defines a lambda expression (think of it as a function body) and gives it a name. Since we don't care whether our bit of code has a name, we can simply define the lambda expression

(lambda (arg)
  (list arg
        (list (quote quote)
              arg)))
          
and use that wherever we wrote "quine", above. Thus:
((lambda (arg)
   (list arg
         (list (quote quote)
               arg)))
 123)
          
returns (123 (quote 123)) and so forth. Of course, instead of "123", we can just quote the body of quine and pass that to the function, giving in effect (body-of-quine (quote body-of-quine)), which should return itself. And here it is:
((lambda (arg)
   (list arg
         (list (quote quote)
               arg)))
 (quote (lambda (arg)
          (list arg
                (list (quote quote)
                      arg)))))
          

I chose to write the framework in Perl, for no good reason, except that Perl makes it easy to run other programs (like Lisp interpreters) and read their output.

Thus, I wrote a Perl program that writes Lisp programs, runs them, and evaluates how well they did.

The randomprog function generates a random chunk of Scheme code. Actually, this function is quite crude and simply generates lists of words that appear in Scheme, and lists of such lists. Much of what it generates is syntactically incorrect and will be rejected by the Scheme interpreter, later on.

The mutate function takes a program, picks a random part of it, and changes it to something else. This "something else" is a random piece of Scheme code of arbitrary complexity, so these mutations can introduce sweeping changes.

The score function runs a candidate program using the SIOD Scheme interpreter, compares the input to the output, and assigns a score to the candidate. There is much scope for play here.

The version given here counts the number of times that the nth character of the output matches the nth character of the program, and divides that by the length of the original program. This gives us the percentage of "correct" characters in the output. This is multiplied by 100 and added to the score.

In Lisp, any number, such as 123, is a valid program. Of course, if you evaluate the program 123, you get "123", which matches its input perfectly. However, this is not at all interesting, so the score function includes an ad hoc rule to subtract 200 points from any program whose output includes only "uninteresting" characters (primarily digits).

An earlier version of this program subtracted a number of points that depended on the length of the program (length1.04, to be exact). This was to avoid a problem with programs growing without bound (see below).

Finally, the main body of the program simply applies the evolutionary algorithm outlined above: it generates a random program, creates 100 mutant children per generation, selects the one with the highest score as the ancestor for the next generation, and repeats.

Results

I did not manage to generate a self-printing Lisp program such as the one described in the sidebar. I did, however, notice some interesting effect.

Scores

The most striking result is that this framework generates programs with high scores, and not necessarily programs like the ones I was looking for. I think of this as the "whatever you can get away with" effect.

For instance, earlier versions of the score function simply counted the number of characters in the output that matched the corresponding character in the input. The problem with this was that you are bound to get some matches purely by chance. If a 10,000-character program printed about 10,000 characters of output, and five percent of the output matched by chance, that program would have a score of 200. A 150-character program that printed an exact copy of itself, on the other hand, would only receive a score of 150.

In addition, many types of Scheme programs return something that looks very much like the input. For instance, in SIOD, the program

(lambda (arg) foo)
    

which defines a function body, returns

#<CLOSURE (arg) foo>
    

If, by some chance, the program used functions whose names had just the right length, the output would line up exactly with the input, and all of the output would match, except for bits at the beginning and end. This rewards programs for growing without bound: a 1,000-character program might get a score of 9994, even if it did nothing at all, and any mutation that increased the length would increase the score correspondingly.

To mitigate this effect, I added a clause to the score function that subtracted the length of the program, raised to the power 1.04. This is an exponential function designed to penalize long programs much more harshly than short ones. The exponent was chosen to be shallow enough to allow some growth, but steep enough to prevent runaway expansion.

In the other direction, as mentioned above, certain trivial programs got unexpectedly high scores. Numbers, strings, and built-in constants such as () and t all evaluate to themselves. These are, technically, self-printing programs, but spectacularly uninteresting ones. Thus, I added a clause to the score function to subtract 100 points from any program under 50 characters in length (a penalty for being too short). Later, I changed this to subtract 200 points from any program that consisted only of "uninteresting" characters (digits and parentheses, mostly).

Finally, it should be noticed that since the randomprog function makes only a token attempt at producing valid Scheme code, much of what it generates will have some syntactic error. Any child that cannot be run automatically gets a score of -10,000.

Mutations

After the scoring function, the next most important factor was the type of mutation produced by the mutate function. We can imagine many different types of mutation. At one extreme, we could simply throw out the parent and generate a completely random child from scratch; this could be useful for exploring new territory. At the other extreme, we could mutate a parent by replacing a number with a different number, or a variable with a different variable, but not changing the overall structure of the parent program; this could be useful for fine-tuning a solution that already works fairly well.

I used an intermediate approach: the mutate function picks a random part of the program and either replaces it with a random piece of code, deletes it, or doubles it. Thus:
Mutation type Input Result
Replacement (a b c) (a (cons (car *pi*)) c)
Deletion (a b c) (a c)
Doubling (a b c) (a b b c)

In this implementation, each child had exactly one mutation, though it would be simple enough to mutate each child twice, thrice, or some random number.

Originally, I only had the replacement and deletion mutation types. I added doubling partly because it occurs in nature, and partly in the hope that it would lead to the sort of program described in the sidebar. The doubling mutation turned out to be particularly interesting since it started with a piece of code that had already survived the previous generation without causing the interpreter to abort.

This turned out to be very efficient at generating the sorts of infinitely-expanding programs described above, particularly since doubling an existing chunk of code gave better than average odds of having matching characters in the input and output, since the parent candidate presumably already had a higher score than its competitors.

The types of mutation matter because they represent different ways of exploring the neighborhood of a proposed solution. Small changes are good at refining a proposed solution to make it work better. Large changes are good at exploring the range of possible solutions to avoid getting trapped at a local maximum. In many EC frameworks, the system starts with large-step mutations to attempt to find the general form of the best overall solution, and then switches to smaller-step mutations to refine it. My framework does not do this.

It should also be noted that this version of the framework only uses mutation, and not combination (sex). It has been found experimentally that by combining aspects of two or more candidate programs to produce a new one, the child can include the best of both parents and excel in areas in which neither parent does.

However, the programs generated by randomprog and mutate vary so widely that it is not entirely clear to me how one would combine two programs. This would almost be like trying to cross-breed sparrows with wasps.

Conclusions

This program failed to generate a non-trivial Scheme program that evaluates to itself. There may be several reasons for this:

It may be that the scoring function did not define an environment where there is a smooth path to the desired result, where each successive change gave a higher score.

It may be that the problem is not amenable to being solve by evolutionary methods. There may be too many local maxima, which give fair to middling scores, but no good way to step from a local maximum to the global maximum without ever getting a lower score. A different scoring function might help here.

It may be that the randomprog and mutate functions produced too many offspring with syntax errors or other obvious problems, so that it would have taken longer to generate an optimum solution that I had patience. A system that made a better attempt at generating valid Scheme code might perform better in this respect.

Nonetheless, this has proven to be an interesting experiment.