The Center for Data Innovation spoke with Simon Edwardsson, co-founder of Aipoly, an AI assistive technology startup based in Melbourne, Australia. Edwardsson discussed what advancements in computer vision means for people with visual impairments, as well as how this technology can be used to help people learn foreign languages.
Joshua New: Using computer vision to assist people who are blind or visually impaired seems like a pretty obvious idea. Why do you think it hasn’t been done as a smartphone app like this before?
Simon Edwardsson: There have been a number of computer vision applications in the last few years trying to help the visually impaired, however they have usually been rather small in scope, such as only identifying bank notes, or required expensive hardware.
In broad terms, there are mainly two different ways to do computer vision on smartphones. Traditionally, the smartphone captures an image, performs some preprocessing, such as resizing, and then uploads the image to a cloud service. All the heavy lifting is done on the server side and the upload speed usually ends up being the bottleneck for the recognition speed. Since the users would have to wait a few seconds for the answers they better make sure that each shot is good, but that can be very hard if you are visually impaired.
However, in the last two-to-three years, smartphones have gotten so powerful that it’s now possible to do all of the computations on the device. This, coupled with advances in image recognition and machine learning, such as much more powerful computers to train algorithms and much bigger datasets of images, have led to where we are now. Today Aipoly Vision can recognize over 4,000 objects at five frames per second. The user can pan their phone across a room and continuously hear what he or she is looking at.
Besides the improved performance from running the recognition locally, we also get a higher level of privacy since the image never leaves the phone and thus won’t risk getting stored on an external server.
New: Given that there are many different kinds of objects in the world, how do you go about creating software that can recognize even a fraction of the things people encounter in a day?
Edwardsson: This is indeed a big challenge. We had to talk to a lot of our users to get a good understanding of what kind of objects they encounter and need help identifying. It is also important to chose a good level of detail for the objects. High level descriptions are easier to identify accurately, like “kitchen appliance,” while low level ones are more useful, like “closed fridge.”
At Aipoly we use neural networks to identify objects, trained by exposing our algorithms to a vast amount of pictures, each one with a label. For example, if we try to create a system that can identify different breeds of dogs, then we would need thousands of pictures for each breed. Each picture would have to show the dog from a different angle, position, background, and lightning to help the network generalize when it sees a new picture of a dog.
A basic rule of thumb is that the more images of an object there are, the more accurate the network will be. And while there are a lot of pictures of dogs online it can be harder to acquire thousands of pictures of less photogenic objects, such as toothbrushes.
New: While Aipoly can recognize an impressive number of different objects, it is still not perfect. For example, I watched a video of Aipoly in action and it identified elevator doors as a refrigerator. How do you solve issues like this when the app can’t always be 100 percent confident?
Edwardsson: When we present an image to our image recognition system, we get both a prediction and a confidence score—if the confidence score is too low we filter away the prediction. However sometimes the system gets confused and gives the wrong prediction with high confidence. One solution is to raise the confidence bar, but then the AI becomes very conservative and barely speaks.
What we try to do is to get more images of both the correct prediction and the wrong prediction so the algorithm can learn the difference, ss well as develop better algorithms. Luckily machine learning and image recognition is a very fast growing field and every few months a new breakthrough is published that makes our system better.
New: While the primary use of Aipoly is to assist people with visual impairments, what other uses do you think it could have?
Edwardsson: Shortly after we released our first version of Aipoly Vision we saw a big spike of users from Japan, before we had localized it to other language. It turned out that it was used for language learning. Basically point your phone at something interesting and you could hear the name in English, which was a use case we hadn’t thought of.
Another interesting example is using the app in education. Imagine a field trip where the students could go out in nature and if they encounter a plant or an animal they are unfamiliar with and they can just point their phone at it to get the name and useful information. This is something we are currently working on and our AI already beats me when it comes to breeds of dogs!
New: Aipoly has been available for over a year now. How has it changed as you get more and more users?
Edwardsson: As I mentioned before, we run the whole image recognition locally on the phone, and while that is good for privacy, we can’t build a huge dataset based on real-world photos even though that would increase our accuracy. Therefore we have trialed various feedback systems where the users could submit a picture with their own label and building a crowdsourced database of relevant pictures. While the idea was promising it turned out to be too hard for users to take good pictures with correct labels. We had to go through every single image to clean up the data.