Speech technology in the era of Big Data

Public interest in speech technology recently increased dramatically, as Google Duplex was demonstrated, and elicited a wide range of reactions. Most observers were stunned at the apparent sophistication and power of the application, which can book appointments over the telephone by “conducting natural conversations” (to quote the Google blog that accompanied the public release of Duplex). However, Duplex also generated substantial controversy, with criticisms ranging from concerns about ethics (is Duplex misleading people into thinking that they are talking to a human?) to privacy (how much consent is required from the companies that are called by Duplex?), all the way to feasibility (were the Duplex demonstrations faked, or at least edited to a substantial degree?)

These are interesting issues, and should also raise general awareness that speech technology is continually improving, along with many other forms of pattern recognition. Both automatic speech recognition (ASR) and speech synthesis (also known as “text to speech” – TTS) have benefited greatly from breakthroughs in Deep Learning during the past decade. These improvements have advanced the capabilities of applications such as voice-based Web search and on-demand audio content generation far beyond the levels achievable five years ago.

However, it is also important to understand the limitations that remain in place despite such advances. Although the most advanced ASR systems are able to recognize certain spoken utterances with human-level accuracy, and the best TTS systems sound every bit as natural as a (human) voice artist under appropriate circumstances, these algorithmic listeners and talkers still lack anything like the understanding that characterizes human intelligence. As a consequence, ASR and TTS systems fail in ways that are deeply unintuitive to anybody who is not intimately familiar with the technology. (In fact, even those of us who have been working on ASR and TTS for decades are sometimes surprised at their failures!)

Such failures make for funny Youtube videos, but can be disastrous in customer-facing applications. The first big wave of speech-technology deployments, in the late 1990s, was only partially successful since many consumers found it too confusing to interact with such unpredictable systems. Corporations that had invested millions of dollars in speech technology were sometimes surprised to find that the withdrawal of their newly-installed systems was a crucial step towards customer satisfaction! Modern ASR and TTS is much better than the technologies of that era, but it is still not clear how widely it can succeed in automating dialogues with customers, where any mistakes of the technology translate directly into unhappy customers.

Another major trend of this decade has, however, created a huge range of additional applications for ASR: in this era of Big Data, we now realize that there is great commercial and social value in the analysis of aggregated data sources. For numerical data, such as consumer purchasing, movement or communication patterns, Big Data algorithms have been spectacularly successful and underpin the operations of companies such as Amazon and Uber. Similarly, textual data has been extensively utilized in the modern economy. However, speech data has been harder to use in the same fashion. Many companies have potential treasure troves of spoken information (e.g., collected in their call centres) which could provide crucial business intelligence, but are currently not accessible for analysis. This is a perfect application domain for fallible speech technology, since occasional mistakes will not obscure the significant trends that are the main aim of analytics for business intelligence.

It will be interesting to see what the future holds for systems such as Google Duplex, and also for less ambitious attempts to perform live communication with end users through speech technology. However, the most impactful applications of ASR right now are likely to operate behind the scenes, and to make speech information a seamless component of Big Data – that is, speech analytics.

Is this the decade of Deep Learning?

Most of us have encountered the excitement surrounding Artificial Intelligence (AI) and Machine Learning during the past two or three years. In the popular press, business conferences and Gartner reports we are frequently reminded that AI has progressed by leaps and bounds in the recent past. In particular, a method known as Deep Learning has been used to develop systems that perform surprisingly well in a wide range of tasks. These successes include image recognition (for instance, tagging Facebook pictures), machine translation (most prominently used in Google Translate) and self-driving cars, which are being developed at various companies and universities. More recently, systems using Deep Learning have achieved notable victories in board games, with a program known as AlphaGo beating one of the top players in the Oriental game Go, and a related program (AlphaZero) performing at superhuman levels in chess.

Extrapolating from these unprecedented achievements, many technology watchers have predicted that Deep Learning is poised to transform AI, business and society in rapid succession. Gartner, for example, has predicted that AI will be the core element in three of the top ten technology trends for 2018, with Deep Learning being an implicit component in each of those predicted advances: “AI Foundation”, “ Intelligent Apps and Analytics” and “Intelligent Things”. (At least two of the other “technology trends” foreseen by Gartner are also likely to require Deep Learning, namely “Conversational Platforms” and “Continuous Adaptive Risk and Trust”. In the same vein, publications such as the New York Times and The Economist have published several articles in recent months on the looming importance of AI in areas ranging from poverty alleviation to consumer electronics. Beyond these prospects lie visions of superhuman intelligence solving the world’s problems and, eventually, the technological singularity which will dramatically alter the entire human project.

Of course, predictions – especially about the future – are tricky, and a number of issues need to be considered when the potential of Deep Learning is assessed. In this blog, I will briefly mention three of the current controversies that are most relevant to the future of Deep Learning, and in future blogs I will explore each of these topics in more detail.

One area of widespread concern amongst the experts is our very limited theoretical understanding of Deep Learning. Our current Deep-Learning algorithms have evolved from concepts that were developed in the 1980s; those algorithms were themselves not thoroughly understood, and the dramatic improvements of the past five years were mostly achieved by trial and error. Consequently, we do not have a good model for explaining why Deep Learning performs so well, or how to make systematic improvements to its capabilities. One popular theoretical model from the past focused on the relative paucity of parameters that are used to describe a large data set (the so-called “Occam’s razor” principle). However, it is becoming increasingly clear that this is not a sufficient basis for understanding Deep Learning: different Deep Neural Networks with the same number of parameters but different network structures differ systematically in their performance. Several proposals for theoretical models have recently appeared in the Deep-Learning literature, but it is probably safe to say that these have not fully explained why Deep Learning succeeds so well – they certainly have not influenced the practical development of systems to a significant extent. Our understanding of issues such as generalization and learning algorithms is still highly immature.

On a somewhat related note, there is also much skepticism in the literature about how far the current batch of approaches will take us. Although Deep Learning has been applied in a wide range of domains, those applications share many similarities: in each case, the task can be represented as a mapping between an input of fixed dimensionality and an output of fixed dimensionality; these mappings are adequately represented by a set of “training examples”; many of these training examples are available for the development of the system, etc. Biological intelligence – including human intelligence – on the other hand, operates under a much wider range of conditions. We generate our own “training examples” from a continuous stream of experience, and solve many problems which are not naturally cast as input-output mappings of the type required for Deep Learning. A variety of extensions to Deep Learning have been proposed in order to address these differences, but many informed critics remain doubtful that anything like human intelligence can be achieved in this manner.

Finally, there are good reasons to wonder whether it is even desirable that Deep Learning should achieve anything like human-level intelligence. One type of concern is represented by the warning from Stephen Hawking, when he said “I think the development of full artificial intelligence could spell the end of the human race.” Others worry that vastly improved AI will exacerbate inequality and unemployment – if most human occupations can be performed by intelligent algorithms, the owners of systems executing those algorithms will become extremely wealthy whereas most people will no longer be able to support themselves through salaried employment.

Although both of these concerns seem way overblown given the limitations of Deep Learning mentioned above, it would be foolish not to think about them at an early stage. One clear lesson from the past decade is that the capabilities of algorithms can improve very rapidly, and even several decades may be too short a period to prepare for changes of such magnitude.

All these unknowns are signs of a field in rapid transition, and in future blogs I will take a more detailed look at each of the above issues.