Google Getting Savvy with Images
One of a search engine’s biggest struggles is to understand images. Historically the likes of Google had very little means of determining what an image shows and the context of an image to the web page it sits on; mostly relying on user-defined information such as the file name and alt tag to make an assumption on the image.
Gradually, over the course of many years, Google have been making inroads into understanding what an image is and how to create a product that will be of most benefit to its users.
A lot of these image-based developments are down to some of Google’s research scientists, Oriol Vinyals, Alexander Toshev and Samy Benigo and software engineer, Dumitru Erhan who published a paper on automatically generating sentences that accurately describe images, without human intervention.
The paper, “Show and Tell: A Neural Image Caption Generator” explains how a model, based on a deep recurrent architecture, uses vector space mathematics to determine what the image is.
You can read the full paper in the Cornell University Computer Science Library and MIT Technology Review has done a good job of simplifying an explanation.
Efforts like these go to show how much Google, more than anybody, understand the huge gaps between AI and humans where vision is concerned.
But that doesn’t mean they haven’t been developing usable image-based tech already.
From a “Whoa, that’s actually really cleverly awesome, I’m genuinely going to use that feature” perspective, no two developments have shown Google’s willingness to improve their knowledge of images than the Google Translate app and Google Keep’s Optical Character Recognition (OCR) capabilities.
Google Translate in Images
The vast majority of us love visiting foreign countries, experiencing new cultures and trying to comprehend the complexity of a new language.
Imagine never having to wonder what a poster, leaflet, road sign, menu or product packaging says, ever again!
That’s where Google makes things massively easier for those who are struggling, or fear they will struggle, to make their way through a foreign country without breaking down the language barrier.
All you have to do is hold up your mobile device to the text, using the camera option in the Google Translate app, and it’s translated, right in front of your eyes, within seconds.
This handy feature will be particularly interesting to Gen Y and Z, whose hobby of traveling is becoming evermore popular. All of the most frequented travel areas – East Asia, South America and Europe – host an amazing array of languages, accents and dialects to decipher in order to efficiently order food at a restaurant, buy anything at a convenience store, book the next leg of the journey or receive medical attention, along with many other day-to-day tasks.
It could even be used in your home country, in authentic restaurants serving foreign cuisine, where 95% of their menu, including the majority of the ingredients, is in a foreign language.
Below is an example of two ways in which you can use the Google Translate app to translate from English to German.
In the above example, you can simply take a picture of an item with words in it and upload it to the app.
The app will identify any areas where there is text and you can either use the “SELECT ALL” button to translate the full block, or highlight any specific areas with your finger to translate any specific words or phrases.
The app then takes you back to the traditional translation-style view of English text in one box translated to a foreign language in another box.
This example shows the live translation of text. It is far from perfect, and any subtle movement of the camera will change the translation – which makes it look very messy, but it tries very hard to get the full block of text translated and contextually and grammatically correct. Sometimes it may take time to translate, but it gets there in the end.
One of the most impressive elements of this feature is not only how Google can recognise words in different languages – 91 in all, which showcases how huge their language database is – but that the app also usually maintains the font that the original text on the image is printed in, and simply replaces the original text, so there is minimal compromise of the design.
Google Keep’s OCR
Google Keep isn’t a very widely used Google feature. Think of it as a cross-device self-run PA crossed with an online, private, memory box.
Nobody explains what it does better than Keep itself: “Capture what’s on your mind. Add notes, lists, photos and audio to Keep.”
When you’re done with an item, you can archive it – just in case you need it again in the future, like a weekly essentials shopping list, for example.
But one of the most impressive things about Google Keep is the way it helps with the usability of text as images.
Ever been provided with an old version of a PDF that is either read-only or a scanned image, and all you really want to do is Copy & Paste some essential data but there’s just no method of getting the text?
Fear not! Google Keep allows you to upload an image and it will automatically detect the characters in the image and produce what it can read in plain text in a matter of seconds.
Below is an image of text from a Life Cover PDS uploaded to Google Keep.
Once the image is uploaded to Google Keep there is the option to “Grab image text” and as soon as that option is selected you’re presented with the unformatted, and not particularly pretty, text version of the image.
The text grab isn’t always 100% accurate, and sometimes there has to be some manual checking to ensure that what’s been deciphered is precise.
The mistakes come when, for example, a screenshot of a web page is being used and the content is in a table. The image reader only works from left to right, taking nothing else into account, as it doesn’t have any HTML direction to use from the webpage.
Using a Google SERP as an example, you will end up with something looking a little like this:
The tool hasn’t managed to distinguish between blocks of text as individual entities, so combines both the organic (left) and paid (right) listing into one text block in the transcription and due to the text being so small it also doesn’t make out special characters, such as the “/” in the URL and has made numerous spelling mistakes throughout.
Increasing the size of the image improves the accuracy of translation, but still may contain minor errors (“/T” after the URL).
There are still a few issues with this tool, but overall it is potentially a huge time saver, particularly if you have an image with clear text and obvious spacing.
The Future of Images and Search
There isn’t much beating around the bush with this topic.
It is getting computers, machines and AI to see, think about and understand images for themselves.
Google have been developing methods of creating computerised brains, neural nets, which can roughly simulate a human brain’s thinking and learning and is a promising lead for the advancement of a machine’s vision, sight and language.
These brains use a method called Deep Learning, which is the ability for a machine to train itself from random inputs of data. As Stephen Levy explains in his in-depth article on Medium, it’s very similar to the way a new born baby learns to understand its senses for the first time.
Machines can begin to comprehend images, learning over time, and with enough data to make connections, exactly what a certain image is depicting and how that relates to other similar images.
Perhaps this could lead to a new Google Image search option which provides contextual results based on the search query, rather than results based on alt tags and on-page content. This would give users the opportunity to provide extremely narrow and precise image search queries and be provided with the exact result they need.
So maybe this machine learning is not to the level of sci-fi movies set 1000 years in the future, but it’s still a very exciting time for the overall development of technologies and robots.
If they can comprehend written words then they can read and speak. If they can understand shapes, objects and colour then they can explain and describe. And if they can make all of these assessments in less than a couple of seconds, especially of things that are moving – say, a human face – can they understand emotion and interact accordingly?
The guys at Aldebaran certainly think so, with Pepper!
In terms of machine learning in relation to search and SERPs, this understanding could lead to Google being able to 100% understand intent and desire of search queries, basing searches on information relative to the real world and serve up exactly what the searcher wants every time.