When I was young, I remember learning to type. Back then, learning to type was a rare skill, and to acquire said skill took a concerted effort.
I remember using this terrible piece of software called Mavis Beacon. Every week, I’d spend an hour tapping out nonsense as a supposed way to learn where my fingers should sit on the keyboard and which keys they should press.
After a few weeks (months?) I became capable of typing somewhat swiftly, and realised that anyone over the age of thirty was incompetent at typing and had no inclination nor time to learn. So I decided to “pivot” my learning into a hobby that got me paid: I’d type peoples handwritten notes onto their computer. In no time at all not only did I have a shiny new Macintosh Performa but I was also a reasonable typist.
Even then, voice recognition was claimed to be a solved problem, an alternative to – nay, the future of – typing.
But it wasn’t, in fact it was terrible. Not only was the actual speech recognition inconsistent, but the wrapping editor software was abysmal. Needless to say it didn’t take off then.
Fast forwards to today, Google has amazing speech recognition. Apple touted voice recognition and Siri as a central leg to their iOS platform. Hows the usage on these things? The technical side of things – of transcribing your voice to text – is solved.
But the design and implementation of voice isn’t solved, in fact, I think it’s almost unsolvable. Times when you want voice to text:
- Disabled users
There really are very few other use cases, the idea that this would ever replace a keyboard is incorrect.
I don’t think voice is less exhausting to use than typing, in fact I think it is considerably more. How often do you type all day? How often do you speak? Now imagine if those were flipped.
I don’t think voice is more efficient than typing, sure – you can speak faster than you can type, but being able to iterate effortlessly over what you have typed is one of the most fundamental reasons why typing killed handwriting!
Finally, I think speech has a psychological aspect that hasn’t really been discussed. I think people feel very self conscious “talking to a computer” – no one would ever sit in a coffee shop or a work space and do this. It’s (allegedly) okay to reply to a text message when surrounded by conversing chums, but if you want to speak you must make a conceited effort to detach yourself.
The effort of speech is much higher than typing, you can’t multitask, you have to be aware of your surroundings, you have to think and then commit, and what is the value of it over typing?
Natural Language Processing
Now that computers can understand – well, transcribe – what you are saying, if the use case is not dictation, what is the use case?
Companies like Apple, Wolfram, and others believe it’s “talking to your computer and your computer understanding.
I believe that this is really misguided. The design of these interfaces, where you can speak like a human to your computer is equal parts amazing and utterly confusing.
Because Siri doesn’t understand what you’re saying, it simply follows a simple structure to extract what you are trying to say. The problem with this is that to non technical users think that Siri genuinely understands what you are saying, which causes them to stray from the rigid structure that Siri requires into actually talking to Siri like a human. The result? Siri doesn’t understand, cannot answer and slowly the users’ trust in Siri diminishes.
Unexpected results from a product kill. (the product, not the user, although…)
The only repeated use case I have heard of with Siri is navigation. Times when you want directions whilst driving: almost always. Times when you want to know if Miami Heat are leading the finals, or how tall Barack Obama is – whilst driving: almost never.
Actually, there is another use case of Siri: showing off how amazing the iPhone is. "I can tell Siri to set an alarm clock at 6AM and it will!” the fact that no one ever does this outside of boasting only reiterates my point.
I think that the future of inputs are simply even better typing inputs. We went from t9 phone keypads, to blackberry keyboards, to iPhone keyboards, now to multitouch keyboards. Some of the keyboards for Android are nothing short of amazing – THAT is the future of inputs.
But what about the future of getting output from a computer? I think that at this point humans are used to interfacing with Google via “a query language” and changing that to a human interface language would be very challenging with very little benefit. The first time I launched Google Now on my iPhone, it knew (from Google searches) that I was a Miami Heat fan. That is incredible – preemptively answering what I want to ask. Zero effort with maximum value with risk of burning a user.