Old English Phonotactic Constraints?

Something interesting happens when you train a neural network to predict the next character given a text string. Or at least I think it might be interesting. Whether it’s actually interesting depends on whether the following lines of text obey the phonotactic constraints of Old English:

amancour of whad sorn on thabenval ty are orid ingcowes puth lee sonlilte te ther ars iufud it ead irco side mureh

It’s gibberish, mind you — totally meaningless babble that the network produces before it has learned to produce proper English words. Later in the training process, the network tends to produce words that are not actually English words, but that look like English words — “foppion” or “ondish.” Or phrases like this:

so of memmed the coutled

That looks roughly like modern English, even though it isn’t. But the earlier lines are clearly (to me) not even pseudo-English. Could they be pseudo-Old-English (the absence of thorns and eths notwithstanding)? Unfortunately I don’t know a thing about Old English, so I am uncertain how one might test this vague hunch.

Nonetheless, it seems plausible to me that the network might be picking up on the rudiments of Old English lingering in modern (-but-still-weird) English orthography. And it makes sense — of at least the dream-logic kind — that the oldest phonotactic constraints might be the ones the network learns first. Perhaps they are in some sense more fully embedded in the written language, and so are more predictable than other patterns that are distinctive to modern English.

It might be possible to test this hypothesis by looking at which phonotactic constraints the network learns first. If it happened to learn “wrong” constraints that are “right” in Old English — if such constraints even exist — that might provide evidence in favor of this hypothesis.

If you’d like to investigate this and see the kind of output the network produces, I’ve put all the necessary tools online. I’ve only tested this code on OS X; if you have a Mac, you should be able to get this up and running pretty quickly. All the below commands can be copied and pasted directly into Terminal. (Look for it in Applications/Utilities if you’re not sure where to find it.)

  1. Install homebrew — their site has instructions, or you can just trust me:
    ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
  2. Use homebrew to install hdf5:
    brew tap homebrew/science
    echo 'Now you are a scientist!'
    brew install hdf5
  3. Use pip to install PyTables:
    echo 'To avoid `sudo pip`, get to know'
    echo 'virtualenv & virtualenvmanager.'
    echo 'But this will do for now.'
    sudo -H pip install tables
  4. Get pann:
    git clone https://github.com/senderle/pann
  5. Generate the training data from text files of your choice (preferably at least 2000k of text):
    cd pann
    ./gen_table.py default_alphabet new_table.npy
    ./gen_features.py -t new_table.npy \
        -a default_alphabet -c 101 \
        -s new_features.h5 \ 
        SOME_TEXT.txt SOME_MORE_TEXT.txt
  6. Train the neural network:
    ./pann.py -I new_features.h5 \ 
        -C new_features.h5 -s 4000 1200 300 40 \
        -S new_theta.npy -g 0.1 -b 100 -B 1000 \
        -i 1 -o 500 -v markov
  7. Watch it learn!

It mesmerizes me to watch it slowly deduce new rules. It never quite gets to the level of sophistication that a true recurrent neural network might, but it gets close. If you don’t get interesting results after a good twenty-four or forty-eight hours of training, play with the settings — or contact me!

Leave a Reply