Programmers and engineers have made great leaps in the science of voice recognition over the past decade, so you’d be forgiven for thinking that this technology is a relatively new development. Much of the reporting and scholarship around voice recognition tech only focuses on the post-2011 Age of Siri, following the release of Apple’s now-ubiquitous personal assistant.
But there’s a rich secret history to voice recognition tech that stretches back to the mid-20th-century, to those early days when rudimentary computers needed to fill an entire warehouse with vacuum tubes and diodes just to crunch a simple equation. And this history not only reveals some interesting trivia about the technology we know and love today, it also points the way towards potential future breakthroughs in the field.
Let’s explore the untold story of voice recognition technology, and see how much progress has been made over the years (and how much has stayed the same).
AUDREY and the Shoebox
In the early 20th century, the U.S. research firm Bell Laboratories (named after founder Alexander Graham Bell, the inventor of the telephone) racked up a string of impressive technological advances: The invention of radio astronomy (1931), solar batteries (1941), and transistors (1947).
Then in 1952, Bell Labs would mark another groundbreaking technological advancement: The AUDREY System, a set of vacuum-tube circuitry housed in a six-foot-high relay rack that could understand numerical digits spoken into its speaker box. When adapted to a specific speaking voice, AUDREY could accurately interpret more than 97% of digits spoken to it. AUDREY is no doubt primitive by today’s standards, but it laid the groundwork for voice-dialing, a technology that was widely used among toll-line operators. (Remember those?)
Ten years later, IBM unveiled its Shoebox machine at the 1962 World Fair in Seattle. Like AUDREY, Shoebox could understand up to 16 words, including the digits 0 through 9. And when Shoebox heard a number combined with a command word (like “plus” or “total”), it would then instruct a linked adding machine to calculate and print the answer to simple arithmetic problems.
Just like that, the world’s first calculator powered by voice recognition was born!
HARPY takes wing
Voice recognition began to take off as a field in the 1970s, thanks in large part to interest and funding from the U.S. Department of Defense and DARPA. Running from 1971 to 1976, DARPA’s Speech Understanding Research (SUR) program was one of the largest research initiatives ever undertaken in the field of voice recognition.
SUR ultimately helped created Carnegie Mellon’s “HARPY” voice recognition system, which was capable of processing and understanding more than 1,000 words. HARPY was particularly significant due to its use of “beam search” technology, which was a far more efficient method for machines to retrieve the meaning of words from a database and better determine the structure of a spoken sentence.
Indeed, advances in voice recognition have always been closely tied to similar strides in search engine tech — look no further than Google’s current dominance in both fields for proof-positive of this fact.
From recognition to prediction
By the 1980s voice recognition tech had begun to advance at an exponential rate, going from simple machines that could understand only dozens or hundreds of spoken words, to complex networked machines that could comprehend tens of thousands.
These advances were largely powered by the development of the Hidden Markov Model (HMM), a statistical method that allowed computers to better predict whether a sound corresponds to a word, rather than trying to match the sound’s pattern against a rigid template. In this way, HMM enabled voice recognition machines to greatly expand their vocabulary while also comprehending more conversational speech patterns.
Armed with this technology, voice recognition began to be adopted for commercial use and became increasingly common in several specialized industries. The 1980s is also when voice recognition began to make its way into home consumer electronics, like with World of Wonder’s 1987 “Julie” doll, which could understand basic phrases and reply back. (“Finally, the doll that understands you!“)
Voice recognition goes mainstream
In 1990, we saw the release of the very first consumer-grade voice recognition product: Dragon Dictate, priced at $9,000 (that’s $17,000 in 2017 dollars). Following this, Dragon Dictate’s 1997 successor, Dragon NaturallySpeaking, was the first commercial voice recognition program that could understand the natural speech of up to 100 words per minute.
1997 also saw the release of BellSouth’s VAL, the very first “voice portal.” VAL was an interactive system that could respond to questions over the phone, laying the groundwork for the same technology powering the voice-activated menus you hear today when calling your bank or ISP.
But after more than 40 years of advancement after advancement in voice recognition technology, developments in the field stalled out from the mid-1990s through to the late 2000s. At the time, voice recognition programs had hit a ceiling of about 80% accuracy in recognizing spoken words due to the HMM underpinning speech technology.
It wasn’t until 2010 that voice technology began to take off again — this time, in a big way.
Google, Siri, and the voice recognition revolution
Apple’s iPhone had already made waves when it came out in 2007, as tech began to re-orient itself towards an increasingly smartphone-centric and mobile-first future. But with the release of Google Voice Search App for the iPhone in 2008, voice recognition technology began to once again make major strides.
In many ways, smartphones proved to be the ideal proving grounds for the new wave of voice recognition technology. Voice was simply an easier and more efficient input method on devices with such small screens and keyboards, which incentivized the development of hands-free technology.
But even more significantly, the design principles Google laid down with Voice Search in 2008 continue to define voice recognition technology to this day: The processing power necessary for voice recognition could be offloaded to Google’s cloud data centers, enabling the kind of high-volume data analysis capable of storing human speech patterns and accurately matching words against them.
Google’s approach was then perfected by Apple in 2011 with the release of Siri, an AI-driven personal assistant technology that likewise relies on cloud computing to predict what you’re saying. In many ways, Siri is a prime example of Apple doing what it does best: Taking existing technology and applying a mirror-sheen of polish to it. Siri’s easy-to-use interface combined with her sparkling ‘personality’ and Apple’s expert marketing of the iPhone helped make the program nearly ubiquitous.
Voice recognition: The next generation
As more and more users opt for mobile devices instead of desktop computers, voice recognition is becoming increasingly central to our day-to-day lives. Soon, personal assistant apps powered by artificial intelligence will be available on every single laptop, tablet and mobile phone, and they’ll all be able to hold up their end of a convincing conversation.
Instead of typing in a search request, smartphone owners can now simply say “Siri, where can I get good pizza nearby?” And rather than having an ad served through a channel like Google Ads (AdWords,) human-sounding personal assistants like Alexa and Cortana will soon be able to integrate the sales pitch into a natural conversation instead.
And beyond entertainment or personal use, there have also been many exciting developments around how voice recognition is used in business and commerce. Voice recognition technology powered by AI can now be used to transcribe phone calls, and even predict the outcome of a conversation based on its tone and the words used.
We may not be able to predict exactly what the field will look like in another 10 years, but one thing is certain: Voice recognition will continue to be at the forefront of exciting new developments in consumer tech and marketing.