Can an evolutionary process generate English text?

Introduction

A fundamental precept of evolutionary biology is that a combination of random variation and natural selection is the fundamental driving force for evolution. In contrast, skeptics of evolution, including many creationist and intelligent design writers, assert that whereas natural biological processes may result in minor changes in a single species over time, nothing fundamentally new can arise from “random” evolution.

For example, David Foster, in a book skeptical of evolution, discusses and then refutes an argument he attributed to Thomas Huxley, namely that a few monkeys typing randomly for millions of millions of years would type all the books in the British Museum. Foster asserts that even a single line of 50 characters could not be produced in this way, since there are at least 8.5 x 1049 alphabetic strings of length 50; thus generating a specific given string “at random” is unlikely even over billions of years.

A computational experiment

To test this claim that evolutionary processes cannot generate readable English text, I wrote a computer program that begins by constructing a set of 1024 segments of text, each 64 characters long. The individual characters are chosen at random according to the natural distribution of individual characters in Charles Dickens’ novel Great Expectations.

The program then finds the longest consecutive match of 16-long segments in these strings with 16-long segment in the text of Great Expectations. The sum of the match lengths for these checks is the score for the given 64-long segment. Note that this scoring function has no specific future target, but only measures how typical the given segment is of text in Great Expectations. In other words, Great Expectations plays the role of “fitness landscape” in evolution.

Evolutionary iterations are then initiated: First, the top-scoring segments are permitted to “mate” (i.e., randomly exchange 4-long character strings, beginning at positions 1, 5, 9, etc.) with another segment chosen at random from the top-scoring segments. Then random changes are made to these strings, much in the spirit of mutations observed in real biology. After these “mutations” have been performed, each resulting segment is scored, and the segments are sorted according to their new scores. This cycle repeats until 10,000 iterations have been performed. At the end of these iterations, the highest-scoring segment is taken to be the result of the trial. The computer program ran for 24,576 repetitions of the process described above.

Many segments generated by the program contain syntax errors and nonsensical or misspelled words. Many other segments are syntactically acceptable but don’t make much sense. But other segments are entirely reasonable, and could easily pass as fragments of literary text.

Quiz

Along this line, I constructed the quiz below, then had it administered to some college students at a large university. They were told only that some of these twenty segments of English text are extracted from the writings of Charles Dickens, and some are computer generated.

1. up at it for an instant. but he was down on the rank wet grass,
2. or do any such job, i was favoured with the employment. in order,
3. at the fire as she took up her work again, and said she would be
4. the monster was even careless as to the word that i had him so.
5. as to go with him to his father’s house on a visit, that i might
6. fitted it to nothing and get the ashes between me to the last.
7. as no relation into another that it is the same room – a little
8. a separation to be made for the desolater, like the man he was.
9. we said that as you put it in your pocket very glad to get it, you
10. that he had treated him to a little bee, he was to call the
11. if he had for a time such an interest here and contented me.
12. great iron coat-tails, as he had done, and then ran to that.
13. he saw me going to ask him anything, he looked at me with his glass
14. on my objecting to this retreat, he took us into another room with
15. been born on there, or that i had the greatest indignature.
16. the chimney as though it could not bear to go out into such a night
17. later to settle to anything i had hesitated as to the sound.
18. the greatest slight and injury that could be done to the many far
19. of it on the hearth close to the fear that she had done rather
20. out of my thoughts for a few moments together since the hiding had

The reader is invited to try to identify which of these are authentic snippets of Dickens’ writings and which are computer-generated segments produced by the scheme described above, without consulting any references. The answers are given in the Appendix below.

Did the program generate anything new?

It is important to note that the computer program constructed many legitimate English words that do not appear anywhere in Great Expectations. Some examples:

administer, agitate, chastened, contentions, despot, discriminate, dispensable, dispenses, distances, foundered, generate, inconvenient, intentionally, liberate,
migration, necessitated, operated, possibilities, remonstration, silenced, situations, termination, threatenings, wandered, weathers

Conclusion

A computer program employing an evolution-like strategy is indeed able to generate English text segments reminiscent of Dickens literature. At the least, some of the better resulting text segments are sufficiently good to fool human judges in an informal test — college students were correct in distinguishing true Dickens from computer-generated segments only about 61% of the time (on average). Full details are in [Bailey2009].

Appendix

These items are authentic Dickens:
1, 2, 3, 5, 9, 13, 14, 16, 18, 20
These are produced by the computer program:
4, 6, 7, 8, 10, 11, 12, 15, 17, 19

This post was entered in the NESCent Blog Contest.

Reference

[Bailey2009] David H. Bailey, “Can An Evolutionary Process Generate English Text?”, Biological Theory, vol. 4, no. 2 (Spring 2009), pg. 125-131, available at Online article.

Comments are closed.