Thursday, 25 January 2018

On similarity

Introduction

Prompted by the book, still only half read, at reference 1, we have been attempting to build a model in Excel VB which does some of the work of a neural network – without having to go to the bother of actually building one. Or of learning how to use of one the many freebies which we imagine can be turned up by Bing or Google.

This model includes a number of nodes, which may be active or inactive, with degrees of activity between. Nodes have names, names which are often ordinary English words, like cat or dog, are strings in geek-speak and which may not be unique.

This model also includes the idea of similarity between strings: this string is like that string, expressed as a function taking two strings as arguments and yielding a small, non-negative real number by way of result. We might use the functional notation F(a, b), where ‘a’ and ‘b’ stand for the strings in question, the first string and the second string, aka variables. So we might have F(‘hat’, ‘hatter’) with the variable a having taken the value ‘hat’ and the variable b having taken the value ‘hatter’.

The purpose of this function is to enable the activity of one node in our model to jump or transfer to other, similar nodes. More similarity of names of nodes, then more jumping. One, as it happens, of a small number of ways in which such jumping can take place.

The purpose of this post is to say something about how surprisingly complicated the business of similarity can get. And jumping ahead, to conclude with the idea that there is no natural, simple, universal way of doing this. It all depends on the work that you want your similarity function to do.

We start with the rather odd idea that F(a, a) should exist. That the formula for saying how similar, or not, ‘cat’ is to ‘hyena’ should also be able to say how similar ‘cat’ is to itself. Or how similar ‘antidisestablishmentarian’ is to the near miss ‘antidisestablishmentarians’. We then say that F(a, a) should be relatively large, probably bigger than F(a, b) or F(b, a) for any values of a and b that you care to choose. The string ‘cat’ should be more similar to itself than to ‘thecatsatonthemat’ or to ‘the cat sat on the mat’, and certainly more similar, to itself than to ‘hyena’.

Then we say that F(a, a) should be larger than F(b, b) if a is longer than b. Having two copies, or two near copies, of the string ‘wetherspoon’ is more significant, is less likely to be the result of chance, than having two copies, or two near copies, of the string ‘mug’. This requirement allows defining similarity as the number of matching characters, but not as the proportion of matching characters.

Then if string a starts with string b, we want a to be similar to b. So ‘cats’ is similar to ‘cat’. If string a ends with string b, we want a to be similar to b. So ‘unwell’ is similar to ‘well’. And lastly, if string a contains string b, we want a to be similar to b. So ‘antidisestablishmentarian’ is similar to ‘establishment’. Perhaps, other things being equal, with descending similarity as we go through the three options.

We wondered about whether the function F should be symmetric, whether F(a, b) should be the same as F(b, a) for any two strings a and b that one cares to choose. And we decided that we do not want this. We want the node ‘apple tree’ to stimulate the node ‘tree’ more than the other way around. This because while ‘apple tree’ is an instance of ‘tree’, a more specialised version of ‘tree’, ‘tree’ is not an instance of ‘apple tree’. Put another way, all apple trees are trees, but by means do we have it that all trees are apple trees. There are pear trees, cherry trees and all sorts of other trees. Put another way again, we want to have it that if a is a substring of b, then b is should be more similar to a than a is to b.

And we want the function F to be sensible about near misses. To do something sensible about the difference between upper case letters and lower case letters and to do something sensible about possible duplications. To do something sensible about other misspellings. Perhaps, even, to know something about which letters are near which other letters, from the point of view of phonetics or acoustics.

And what about the triangle inequality which says that the sum of two sides of a triangle is always greater than or equal to the third? An inequality from which all kinds of good things flow. An inequality which can be given meaning in this context by saying that the distance between two nodes is some kind of inverse of their similarity, one goes up when the other goes down.

We note in passing that we are working in a language, English, which has an alphabet. In the absence of an alphabet one can still compare sentences by comparing the constituent words, but this is a bit coarsely grained compared with character based comparisons. Perhaps in the case that one does not have an alphabet, much greater reliance has to be placed on how words sound – to the exclusion of words which cannot be sounded (‘cat1b’), which might otherwise be useful – and replacing the problem of misspellings with those of poor or varying pronunciation and poor or varying reception. From the point of view of transmission at least, letters are much more robust than sounds.

Our routine

All of which sounds terribly reasonable. So we try to come up with a function F in VB which does this sort of thing. Or, more precisely, a subroutine.

A first shot is the simple rule mentioned above, to say that similarity is the number of characters that two strings have in common. So the score for ‘cat’ and ‘at’ is two.

We might then add a rule about duplicates, perhaps dropping duplicates from both strings, perhaps scoring the number of matches. So the score for ‘caaaat’ and ‘caat’ might be four.

Pushing a bit harder, do we have the same score for ‘CaAAaT’? Do we have the same score for the pair ‘cat sat on the mat’ and ‘cat sat’ as for the pair ‘cat sat on the mat’ and ‘catsat’? Is the loss of the space from this last string significant? Then switching to functional notation, is F(‘cat12’, ‘cat13’) greater than F(‘cat12’, ‘cat1b’)? Is F(‘red cat’, ‘cat’) the same as  F(‘cat red’, ‘cat’)?

One way to proceed is to try to find a map which tries to map, or match, all the characters of the second string onto characters of the first string. One then scores that map. One then tries to find the map, out of all the possible maps, which gives the best possible score. In this, we might have a rule which says that different characters in the second string must not be mapped onto the same character of the first string. We might have another rule which says that the matching characters have to occur in the same order in the two strings, perhaps with a bonus being given for matching characters which are consecutive in both strings. With a problem here being that there may well be more than one way to do matching of this sort, with one way not giving the same result as another. Perhaps we deal with that one by saying that our similarity function takes the best possible match, the one that gives the most similarity, and leave the work that computation of that function might entail to someone else.
Following the earlier thought, we then need to check what happens with substrings. It would be good if when a is a substring of b, then F(b, c) is always less than or equal to F(a, c) – because if a is smaller than b, we have less stuff in the first string to be matched with stuff from the second – and F(c, b) is always less than or equal to F(c, a) – because we have more stuff in the second string to be matched with stuff from the first.

After some days of this, we wind up with a routine, about one fifth of which is illustrated above. A routine which involves nine constants (for example ‘c081’), all of which need to be given sensible values, and all kinds of choices about how exactly the algorithm is going to work. With our hopefully having made sensible choices, taking into proper account the sort of strings my routine is going to encounter. All of which now needs to be tested on a good sample of same.

Other peoples’ routines

Reference 2 points to a variety of other ways of doing this sort of thing, some of which look to be well established and some of which look a good deal more complicated to compute than the algorithm we have come up with.

But some of them looked nice and simple, with one such being based on the scalar product of the two strings. Unfortunately, that one stopped being nice and simple once one started to say what one meant by the scalar product of two character strings, this product being more usually defined on vectors of numbers rather than vectors of characters.

However, we did like the look of a Russian distance, the Levenshtein distance, described in more detail at reference 3, with the idea being to count how many changes your have to make to get from one string to the other, with the implementation of this distance being a lot shorter and neater than we had expected. A model of concision – if not of comprehensibility, although Wikipedia does include some helpful examples.

A bit of translation work into Excel VB and this new routine is now working, with a number of extra twiddles to deal with things like duplicates and upper and lower case - and with a modest six constants rather than the nine we had before. But not quite as short and neat as at first appeared.

And the first observation is that it seems to give much the same results. Strings which are near each other get good scores, strings which are not get bad scores. Maybe that is good enough. Or do we need to make a more serious comparison?

What about the brain?

The brain seems to be quite good at assessing how similar things are, be those things birds, bees or words. And given that words arrived on the scene quite late, it seems unlikely that any of the brain’s machinery is particularly adapted for the comparison of strings.

Contrariwise, it does seem quite likely that there are places in the brain for particular words, places which can be connected together, rather in the way that we want to connect our nodes together. Places which can be linked together by synaptic connections. With the individual links being directed, and with a collection of links possibly being in one direction or the other or, more likely, something in between, rather like our similarity function.

But we do not know whether places for words are linked together according to how they are spelt, to how they are heard, to how they are said or simply according to how they are used. Or, indeed, whether the building and breaking of links uses all three mechanisms. Remembering in this that hearing and saying are old in evolutionary terms compared with writing.

In any event, while a neural network in a brain might well do something very like the similarity described above, it seems very unlikely that it would start with rules and then code them up. Much more likely that some kind of assembly of neurons implements something or other, then evolves, quite quickly, as the host grows up, evolves towards something which works. Much more likely still that real neural networks are not terribly modular and functions like similarity are all mixed up with lots of other stuff and hard to untangle. Perhaps implemented in different ways in all the different places where some kind of a similarity function is needed.

Turning to genes, it is clear that they specify a general purpose computing facility which somehow has to learn how to do particular things, like similarity, for itself - noting in passing that these same genes have provided the vertebrate retina with the fairly specialised equipment needed to detect useful stuff out in the real world. John Holland, an eminent computer scientist who took an interest in such things, was apt to talk about the Kreb’s cycle for generating the energy needed to keep organisms alive, a cycle which is strong conserved, present in more all less all the larger living organisms that have been around for the last billion years or so. But we are not sure that the fact that a reasonably complicated chemical cycle occurs everywhere, is more or less the same everywhere, means that the same will be true of similarity algorithms. While the genes might have resulted, in time, in assemblies of neurons that do similarity, they do not code for similarity in the ordinary, computing sense of the word.

Closing thoughts

The first thought is that it is surprising how quickly all this can become rather complicated. Perhaps it would all be a lot easier if we used an artificial neural network rather than a conventionally coded algorithm, with all its complicated rules and regulations. We just give our network some indications about how our similarity function is to behave, perhaps by giving it some examples, and it works away, by itself, to come up with something, something which works even if we do not understand the detail of how it works. We might have trained the network, but don’t know what the rules it has come to along the way. Which can be annoying.

Another line would be to write a similarity function with lots of parameters, then to search the space defined by those parameters for an optimum solution, with optimum perhaps being defined in terms of performance on some set of examples.

But a big part of the answer seems to be that it all depends. There is no one way to do similarity and while, other things being equal, it is best to keep things simple, it all depends on what work you want to your similarity to do and what sort of things you want your similarity function to compare. Are we writing a text retrieval system or are we trying to eliminate misspellings? Or what? Or, as they used to teach us in the days of structured methods, don’t start by rushing into code. Start by being clear about the requirement.

The closing thought is that a similarity function only has to be good enough. While it is nice to have short and neat code, that should not be an end in itself. And while it is nice for the code to give the right answer for kinds of perverse examples, that should not be an end it itself either. It is enough to work on sensible, real world examples, most of the time.

PS: some of us used to add to requirements analysis the proviso that the requirement interacts with what it is possible; we are in a feed-back, not a feed-forward situation. Which makes things much more complicated.

References

Reference 1: Before You Know It: The Unconscious Reasons We Do What We Do - John Bargh – 2017.

Reference 2: https://en.wikipedia.org/wiki/String_metric.

Reference 3: https://en.wikipedia.org/wiki/Levenshtein_distance.

No comments:

Post a Comment