Monday, 6 August 2018

Counting words

Words are important to consciousness studies in at least two ways. First, it seems likely that words are what we use to organise a good deal of our inner life, our consciousness – on which point see the earlier effort noticed at reference 7. Second, it is true that words are the way, in everyday life as opposed to in the laboratory, one person tells another what it is that he has in mind, what it is that he is conscious of. And some people find words fascinating anyway.

So I was moved yesterday to wonder about how many words the brain needs to make space for and quickly found myself in the well worked fields of making lists of words and doing experiments with words. With one popular experiment being to measure how long it takes to say a word after seeing it on a screen and another being to measure how long it takes to say whether a string of characters which is presented on a screen is a word or not – with a rule of this particular game being that it has to be possible to say the string, it has to be like a word to that extent. With both experiments being cheap proxies for whether the subject really knows the meaning of the word in question.

Lots of open access material here, starting with reference 1 which was turned up for me by Bing. I have also become the proud possessor of an Excel workbook listing 60,000 odd English words. To quote: ‘Although the list was compiled for research purposes, we are fairly confident it contains the vast majority of reasonably known English words (in American English spelling), but we agree that it would be possible to add about the same number of more specialist words’. By way of example, words 50-63 are abecedarian, abecedary, abed, abele, aberrance, aberrancy, aberrant, aberration, aberrational, abet, abetment, abettor, abeyance and abeyant.

I offer various other snippets from this material in what follows.

The starting point

I think here in terms of English. I dare say that other languages using simple alphabets which are much the same as ours will be much the same, but I have not thought about languages without such alphabets.

It seems that dictionaries like the OED are no longer the only way forward, the only place to start, and more usually one either plumps for one of the various large databases of written English (for example that at reference 5) or one of the databases of the subtitles attached to most mainstream television programmes, including all scheduled BBC programmes. More spoken than written English. To this one can add other lists, perhaps drawn from dictionaries, perhaps drawn from spelling checkers.

Very roughly speaking, this yields maybe 500,000 distinct word like objects, which we call tokens in what follows, of which maybe 50,000 end up counting as words. So how does one make the cut?

Making the cut

Making the cut seems to be a bit of a black art, with lots of dodges and wheezes, some of which are listed below, and there is no one way, one right way to do things. But it helps to keep in mind what your list is for. Is it supposed to be a list of words in regular use in New England, are you trying to build a list for competitive Scrabble or are you trying to build a list for an experiment in the Department of Cognitive Psychology at a university somewhere in Middle England? Some of the dodges and wheezes now follow.

Drop tokens which include any non-alphabetic characters.

Drop tokens which include any accented characters. This might be OK for English but might not be OK for languages which are routinely accented.

Drop tokens which include any non-alphanumeric characters.

Reduce any token which includes non-alphabetic characters by dropping those characters.

Replace all upper-case letters by lower case letters.

Drop tokens which do not occur more than once (or whatever other number you might care to pick) in the corpus in question. A simple device to knock out obscure words and obscure misspellings. Bearing in mind that there are lots of rare words, occurring with a frequency between 1 in a million and 1 in ten million words – on which point see the section ‘Word Frequency Measures’ in reference 2.

Drop tokens which are misspellings. Although one then needs some other list to tell one which tokens are misspellings.

Drop tokens which cannot be said. I dare say one can get a computer to have a fair stab at this these days.

Lots of words are grouped, for example, run, ran, runs and running. Does one need to retain all four tokens? What about big, bigger and biggest? Establishment, establishmentarian, establishmentarianism, disestablishmentarianism and antidisestablishmentarianism? With the first edition of the OED liking to spin out lots of this kind of stuff. And bearing in mind that lots of words have several, quite different meanings, and a word which might be thought redundant in one context might not be redundant in another.

Drop proper names. How does one know? And while this might be right for the OED, the second half of a Larousse does do proper names, as does a brain. Furthermore the brain will have slots for things which do not have proper names in the ordinary sense, or for which it does not know the proper name, but which are, nevertheless distinct individuals. The tree at the bottom of my garden, not just any old tree. A problem a computer deals with by allocating long reference numbers, reference numbers which the user might only get to see when things go wrong. Maybe the brain gets by with place.

What about compounds, words like bedroom and lay-by? Drop compounds altogether? Drop compounds using a hyphen or space as separator? Drop compounds where the meaning derives in an obvious way from the (usually two) components? Under which rule a word like ‘honeymoon’ would survive but what then about compounds like blow-dry and bedroom? See references 4 and 6.

Drop foreign imports, unless they are clearly well-established in their new home. How does one know?

Drop obscure technical terms, only of any interest to those with a professional interest in the subject in question. Consider for example the proper name of the Bactrian camel, camelus bactrianus. Maybe drop Bactrian because it is a straightforward derivative of Bactria. Drop Bactria because it is the (proper) name of a place which does not exist any more. Drop camelus and bactrianus because they are foreign. Which only leaves us with camel, which seems a bit thin. Another example would be the important legal phrase ‘mens rea’. Real Latin, as opposed to cod Latin.

Cost is going to be an issue here. Stuff you can do with a computer is one thing, but stuff requiring human intervention, given the numbers involved, is going to be expensive. And if lots of humans are involved there are likely to be quality issues.

Doing the count

The basic idea is neat and simple. You have your list of words. You take a random sample of manageable size from the list. You then sit the subject in front of a screen and present each word in turn, with the subject hitting a key to tell you which words they know. You then have an estimate of the proportion of the words on your list that your subject is likely to know and you can easily scale up to get the total number of words on your list that your subject is likely to know. The total number of words which the brain needs to find room for, somehow or another.

First enhancement: add in some non-words to control for false positives.

Second enhancement: add a suitable number of random probes to check whether the subject really does know the meaning of the word that he claims to know of.

Third enhancement: do something statistical with the reaction times, the time from display to hitting a key. This sort of thing being the point of it all for many of the customers for lists of words.
And that’s just three enhancements. A lot of work seems to have been put into the design of these sorts of tests.

Other points

Regarding the cut from the point of view of the brain, it might be content to compute regular derivatives. It does not need, for example, a pigeon hole for both boat and boats, because it can get from one to the other by computation. But it might want common misspellings, which are harder to compute and which do crop up in real life.

Leaving aside the big differences in digital availability, there are relevant differences between languages. With, for example, English and German doing compound words in rather different ways. Reference 4 gives something of the flavour here, including the observation, attributed to Mark Twain that some German words are so long that they have a perspective. One size does not fit all.

Conclusions

The concept of word is clearly a tricky one, much more tricky than at might first appear. As so often with concepts, it all depends on the work to which you are going to put it: what sort of heavy lifting do you propose for it?

However, all that said, for the purposes of brain capacity, I shall, for the moment, go with the estimate that it needs slots for up to 50,000 words. A modest million neurons if we allow 20 neurons to mark a slot, a pigeon hole – with the contents for that pigeon hole accounting for a good many more, perhaps several orders of magnitude more, which starts to be a more serious dent in the total cerebral cortex supply of the order of 20 billion or so – that is to say excluding the 100 billion or so in the much smaller cerebellum, round the back.

PS: along the way I bumped into the world of serious Scrabble and learned that the last word in the authoritative list of Scrabble-permitted words is ‘zyzzyva’, a sort of weevil found in the hot parts of central and south America. But I did not learn how this word was relevant, given that our own Scrabble set includes just one ‘Z’ and using the two blanks for such a word seems both unlikely and excessive.

References

Reference 1: How Many Words Do We Know? Practical Estimates of Vocabulary Size Dependent on Word Definition, the Degree of Language Input and the Participant’s Age - Marc Brysbaert, Michaël Stevens, Paweł Mandera and Emmanuel Keuleers – 2016.

Reference 2: SUBTLEX-UK: A new and improved word frequency database for British English - van Heuven, W.J.B., Mandera, P., Keuleers, E., & Brysbaert, M. - 2014.

Reference 3: The English Lexicon project - Balota, D. A., Yap, M. J., Cortese, M. J., Hutchison, K. A., Kessler, B., Loftis, B., Treiman, R – 2007

Reference 4: https://www.thoughtco.com/german-compound-words-1444618. The source of the illustration.

Reference 5: https://books.google.co.uk/. A large text database from Google. Giving it the words used earlier in the week in the Forsyte connection, ‘and you shall say the slaves are ours’ resulted in the right answer at hits eight, nine and ten. While the words ‘and I fell in love with Edward Johnston and physically’ from Eric Gill’s autobiography (to be reported on shortly), resulted in three different editions of same being the first three hits. So it does seem to work, at least on that sort of text. But there was a hint that it is quite sensitive to the exact words used. Is there a butterfly wing flapping effect here?

Reference 6: English Compounds and Their Spelling - Christina Sanchez-Stockhammer – 2018. She says that a good rule of thumb is that a compound which does not function as a noun gets a hyphen (blow-dry). If more than two syllables are involved altogether, then a space (washing machine). If the second part has two letters, then a hyphen (lay-by). Otherwise run the two words together. Not fool proof, but works a lot of the time.

Reference 7: http://psmv3.blogspot.com/2017/01/progress-report-on-descriptive.html.

No comments:

Post a Comment