Contents
- The problem
- Some similarities and differences
- An ornithologist’s digression
- A librarian’s digression
- The big difference
- Some comments and conclusions
- References
The problem
The core problem here appears to be a variation of the image classification problem, as discussed at reference 1.
We have a lot of labelled items of music, also known as songs, tracks or works. Where label includes classifications, descriptors, names and identifiers. Certainly in the past, a lot of this labelling had to be done by humans.
We have digitised versions of one or more performances of each work. These digitisations might all use the same format, or might use a variety. One such might be the Opus format described at reference 2.
We then suppose that we have a clip taken from one of those performances, perhaps just few seconds, perhaps captured on a mobile phone. The task is to find the work and, preferably the performance. Put another way, the task is to classify a signal to performance. So not so very different from being given an image and being asked what sort of animal is being portrayed, to classify a signal to animal. Or to classify a signal to person – where there might be a lot more persons to look through than animals, so to that extent more like the performance task.
Being full of image processing from Google, I assumed that the answer, once again, would lie in neural networks.
Some similarities and differences
Both sight and sound are pervasive, they are everywhere. But sound can be projected into space and on into people in a way that sight cannot. And people can consume sound while doing something else in a way that they cannot usually consume sight. We can listen to music, at least after a fashion, while baking a cake. Or to an episode of the ‘Archers’. And many jobbing builders do not seem to be able to function at all without music in the background. But watching an episode of ‘Coronation Street’ is more disruptive – although not so disruptive as to stop many housewives having televisions in their kitchens.
One form that sight takes is pictures. A picture is a unit of sight in much the same way that a song or a track is a unit of sound. But songs have become pervasive in a way that pictures have not. Particular songs become popular and earn revenue. People learn them and hum them; maybe even sing them out loud. For all these reasons, songs are a big business, reaching right into the mobile telephones which most of us now own. Granted, pictures are a big business too, but a quite different kind of business, and only very few of us are aware of buying pictures, while lots of us are aware of buying songs, buying services associated with songs or at least looking at the advertisements which have paid for the songs.
All this may have something to do with the fact that Google and Bing find far fewer technical papers about classifying clips than they do about classifying pictures. The people making song technology want to keep in to themselves, to retain their commercial advantage. That said, either Google or Bing did turn up reference 6, which turned out to be something of a revelation. Of which more in due course.
An ornithologist’s digression
People do pay for information about songs, if only by looking at the advertisements which come with the information, but they don’t pay for information about pictures. While we see a niche market for information about birds.
There are plenty of people out there who spot birds, birds in their gardens, birds on their bird tables and birds when they are out and about. Some of these people have the collecting bug, are tweeters who like to name the birds that they see and to make lists, possibly on paper, possibly in albums and possibly on spreadsheets. Some of these people are called ornithologists, perhaps even appear on television or in episodes of ‘Midsomer Murders’.
But they have a problem in that all too often they do not know what the bird in question is and such a bird cannot be put on a list, does not count as a tweet. Bird identifiers such as that offered by the (very rich) RSPB are rarely of much help. So it seems likely that a commercially viable proportion of these people would be happy to pay a modest sum if they could get their mobile phone to tell them what the bird was. Even though getting their phone to take the load would take some of the fun out of the business of tweeting.
Such a gadget would need various components. First, something to take over the camera part of the phone so as to get a better picture of the bird than you get by just snapping it – which is apt to result in a very small bird against a very large field. Second, a large database of labelled bird pictures. Third, some whizzy software, quite possibly of the neural network variety, to match the inbound picture with those on the database. Fourth, something to return the answer to the customer and to collect a modest fee.
There is clearly an opportunity here for an entrepreneurially minded student of computer science to team up with an ornithologist.
A librarian’s digression
For various reasons, with various uses and applications in mind, librarians of the old fashioned sort have put a lot of effort into building catalogues of both pictures and songs. Such catalogues had to be searchable and so, if on filing cards, had to be in some useful order. They also had to tell you something about the things so catalogued; one might be into the
catalogues raisonnés of the art world, typically offline. And they had to tell you where you might find the thing catalogued: the catalogue entry is not usually enough, the user wants to get at the real thing, the original, a distinction which is now blurring with digital copies of both sight and sound artefacts embedded in catalogues now getting very close to the original. For example, if one goes to the Digital Public Library of America, one can very quickly, via the Hathi Trust Digital Library and the Getty Research Institute, be viewing an impressive facsimile of an ‘authentic copy of the codicils belonging to the last will and testament of Sir Hans Sloane, Bart. deceased, which relate to his collection of books and curiosities’. With said Sir Hans Sloane writing his codicils in the middle of the eighteenth century.
|
Figure 1: a facsimile, including the print showing through from the back |
Computer catalogues clearly admit much more flexible searching than card catalogues, and a music computer catalogue might be organised along the lines suggested in Figure 2 below.
|
Figure 2: an old-style catalogue |
In this context, the score of a work might well be important, be expensive to set up in print and be subject to copyright, so the owner of the copyright is important too. Alternatively, the score may not exist at all and one performance of what is nominally the same work may differ significantly from another.
All that aside, given a query, involving one or more classifications, possibly in some Boolean combination, the computer simply has to return the records which meet that query. Or perhaps to count them, or to analyse them in some other way.
Such catalogues are fine for telling you the name of a work or a song which you know something about. And telling you where you might get a score or a recording, perhaps offering to sell one or the other, if not both. But they might well not contain digital versions of works catalogued and they will not do query by clip.
The big difference
But newer catalogues do contain digital version of the works catalogued and will do query by clip.
Which brings us to the big difference. Where at reference 1 we talked of complicated shenanigans with neural networks, it turns out that at least one version of the sound classification problem is built with relatively simple algorithms of the conventional variety.
The idea is to generate a key for every song. You then generate, in much the same way, a key for the clip you want to identify. Match the key from the clip to the keys in your database of songs and job done. So what is this key? – and this is where reference 6 helps.
First, you turn your song into a spectrogram, a two dimensional plot of volume by frequency and time. You identify the bright spots in the spectrogram and drop the rest, giving one what Wang calls the constellation map, by analogy with the sort of photograph included at Figure 3 below.
Next you take a subset of the bright stars and a subset of their neighbours. For each selected star-neighbour pair we extract a pair of values, the ratio of the frequencies – the interval in music-speak - and the distance in time. Notice that while we have used amplitude, the loudness of the note in question, in our selection of bright stars to tag, we have discarded amplitude in forming these these tags. String together these tags – which occur at a rate of several a second – and we have the key for the song or for the clip in question.
We than match song A to clip B if they share a sufficient, time matched sequence of such tags. And when clip B is taken from the same recording of song A, we usually do get a match, a match which is not greatly disturbed by the inevitable noise in the clip. Nor is it dependant on use of western musical conventions in matters of tone and time – although for this particular algorithm to work, intervals of tone and intervals of time more generally must be central to music the world over.
|
Figure 3: the stars at night |
All in all, a slightly more cunning and rather more successful version of the n-grams of sound that Downie and his colleagues had worked on a few years earlier and reported on at reference 7. N-grams of sound which were rather like the n-grams of words which were, in many ways, the precursors of the neural networks now widely used in the language processing noticed at reference 10.
But in this case, it seems we can manage without neural networks – with more or less conventional matching which can be done at high speed against databases containing millions of songs.
We can also manage without most of the human effort needed to classify all the songs in the database in the ordinary, librarians’ way, effort which might otherwise have had to be sourced from the Amazon Mechanical Turk of reference 9.
Which is a winner for the likes of Shazam (reference 5) and Soundhound (reference 8). I have not yet quite worked out where Gracenote (reference 4) fits into the story, but these last have a big footprint out on the net, floating to the top of a lot of queries about this sort of thing and, according to the press release which came with their joining the Neilson family, they are ‘the industry’s leading media and entertainment metadata provider’. Noting in this the ‘meta’ bit. They also appear to have something to do with the provision of in-car entertainment, probably important when driving across the wide open spaces of the US. In any event, a substantial operation, built in part, at least, on music identification.
|
Figure 4: a new style catalogue |
There is a lot of information out there about the distribution of queries from consumers and consumer devices (like mobile phones) about clips and works and it may well be that the query engine is helped along by its knowledge of all these queries and all the answers that were given – and to whom.
All in all, as already noted above, compared with image classification systems, there seems to be a lot of money in this sort of query. It is worth spending a lot of money on systems that get it right – and then patenting them. Which may have something to do why neither Google nor Bing have been very good at turning up technical material on identifying music.
Some comments and conclusions
The device described above depends on the same clip not cropping up in more than one than one work. Which seems slightly awkward as we had thought that lots of works of music quote or copy from other works.
It also depends on time, and is unlikely to work well with live performances, the times in which are unlikely to be a good enough match to any recording, although reference 6 does report some success in this area.
But why does this simple algorithm work for music, while something much more complicated seems to be needed for pictures?
Maybe the answer here is that we are not doing the animal task. We are not analysing the image, we are not looking for heads, ears, trunks or whatever, we are just matching a derivative of the clip against the same derivative of all the works in the library.
It seems quite likely that one could match a clip from an image against a database of images without invoking neural networks. Least squares might be plenty good enough. What is not so easy is saying that this picture, which does not occur in the database, is an elephant, based on the pictures which are labelled elephant and which are in the database. It is possible, at least in principle, to make a database of all the music that has ever been recorded. But it is not possible to make a database of all the possible images of elephants. The best we can hope for is enough images of elephants to capture their essential essence, essential essence which we hope the magic of our neural network can bite on.
So the two problems are not so similar after all. And the difference is not to do with the number of dimensions, say two for the spectrogram of music and three for that of a picture.
References
Reference 1:
http://psmv3.blogspot.co.uk/2017/11/more-google.html.
Reference 2:
https://en.wikipedia.org/wiki/Opus_(audio_format).
Reference 3:
https://en.schott-music.com/.
Reference 4:
http://www.gracenote.com/.
Reference 5:
https://www.shazam.com/gb.
Reference 6: An Industrial-Strength Audio Search Algorithm - Avery Li-Chun Wang – 2003. Available at
http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf.
Reference 7: Evaluation of a Simple and Effective Music Information Retrieval Method - Stephen Downie, Michael Nelson – 2000. Available at
http://people.ischool.illinois.edu/~jdownie/mir_papers/sigir2000paper.pdf.
Reference 8:
https://www.soundhound.com/.
Reference 9:
https://www.mturk.com/mturk/welcome.
Reference 10:
http://psmv3.blogspot.co.uk/2017/11/reading-brain.html.