Figure 1 |
Some of Google’s interest stems from their work with driverless cars and the need to be able to sort out the streams of image data from their on-board cameras. Another line of inquiry mines the huge numbers of house number images in Street View to help investigate how getting computers to read numbers and words might best be integrated with getting them to take an intelligent interest in pictures – an interest which can sometimes be helped along if the computer can read the incidental numbers and words. Like the labels on bottles of wine. See reference 4.
Without pretending to understand much of the detail of such neural networks, I have attempted the overview diagram included above.
Starting at the middle, the red boxes are the neural network and its supporting environment. Supposed here to consist of a large number of weights to be applied to inputs to the rather smaller but still large number of neurons (possibly as many as hundreds of thousands), some system parameters (controlling things like the details of the optimisation involved in training and of the competitions between candidate labels) and a small number of design constraints (things like the speed with which an image must be labelled, the accuracy required of the labelling and the amount of computer power that is available).
The neurons are arranged in layers, with one layer feeding into the next. In the beginning, a small number of layers, but now Google has learned how to have lots of layers – yielding impressive improvements in capability and performance.
Moving to the left, the green boxes are the images which we want the computer to label, using the vocabulary included far right. Words like ‘red’, ‘stripey’, ‘coypu’ and ‘octopus’, where we allow more than one label to the image. For the moment we suppose that the images are about things and that we are interested in what those things are, not in what they might be doing. We suppose also that most of the images have been labelled by humans and most of those labels are available to the computer for training purposes.
The validation images are used to tune the system parameters and the testing images are used to test the tuned system. Does the system actually label previously unseen images in the way intended? With the separation of validation from training being roughly comparable to the separation of system from unit testing in regular systems.
Figure 2 - taken from reference 6 |
The usual drill is that you present the system with images, compare the labels that it outputs to the labels that are desired and then to feed the differences back into the system as incremental changes to the weights. Do it again. And again. Gradually, if you have set things up properly, the weights will converge to a place where the system will correctly label all kinds of images, including ones that it has not seen before.
A new-to-me drill is that instead of feeding differences back into the weights, you feed back into the images. So you have an image of a zebra which you want the trained system to label as a whale.
Figure 3 |
Figure 4 |
Putting it another way, instead of tweaking the system to fit the images, you tweak the images to fit the system.
A more obviously practical application, testing complex systems, is to be found at reference 5.
Reference 1: http://psmv3.blogspot.co.uk/2017/11/reading-brain.html.
Reference 2: https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html.
Reference 3: http://image-net.org/challenges/LSVRC/2016/index.
Reference 4: http://ufldl.stanford.edu/housenumbers/.
Reference 5: DeepXplore: Automated Whitebox Testing of Deep Learning Systems - Kexin Pei and others – 2017. See https://arxiv.org/pdf/1705.06640.pdf. With Arxiv being an imprint of the Cornell organisation.
Reference 6: ImageNet Large Scale Visual Recognition Challenge - Olga Russakovsky and others – 2014. See at https://arxiv.org/pdf/1409.0575v3.pdf.
No comments:
Post a Comment