OpenAI open-sources Whisper, a multilingual speech recognition system
Speech recognition is still a difficult problem in AI and machine learning. OpenAI today open-sourced Whisper, an automatic speech recognition system that the company claims enables “robust” transcription in multiple languages as well as translation from those languages into English, as a step toward resolving it.
Speech recognition systems, which are at the heart of software and services from tech behemoths like Google, Amazon, and Meta, have been developed by countless organizations. Whisper was trained on 680,000 hours of multilingual and “multitask” data that was gathered from the web, which led to improved recognition of distinctive accents, background noise, and technical jargon, but according to OpenAI, this is what makes Whisper unique.
“AI researchers looking into the robustness, generalization, capabilities, biases, and constraints of the current model are the main intended users of [the Whisper] models. However, Whisper also has the potential to be quite helpful for developers as an automatic speech recognition solution, particularly for English speech recognition, according to OpenAI in the Whisper GitHub repository, where different versions of the system can be downloaded. “In about 10 languages, [the models] produce strong ASR results. They haven’t been rigorously evaluated in these areas, but they may show additional capabilities if tuned for certain tasks like voice activity detection, speaker classification, or speaker diarization.
Whisper has flaws, especially with regard to text prediction. Whisper may include words in its transcriptions that weren’t actually spoken, according to OpenAI, because the system was trained on a lot of “noisy” data. This is possibly because Whisper is simultaneously attempting to predict the next word in audio and to transcribe the audio itself. Whisper also doesn’t perform equally well across linguistic barriers, exhibiting a higher error rate for speakers of languages underrepresented in the training data.
Unfortunately, the field of speech recognition is not entirely original with that last statement. Even the best systems have biases; a 2020 Stanford study found that systems from Amazon, Apple, Google, IBM, and Microsoft made significantly fewer mistakes — about 35% — with white users than with Black users. OpenAI anticipates using Whisper’s transcription capabilities to enhance current accessibility tools despite this.
While Whisper models can’t be used for real-time transcription right out of the box, the company explains on GitHub that due to their speed and size, others may be able to build applications on top of them that enable close to real-time speech recognition and translation. The disparate performance of these models may have real economic implications, according to the real value of useful applications constructed on top of Whisper models. We hope that the technology will primarily be used for good. Making automatic speech recognition technology more widely available could encourage more actors to develop effective surveillance technologies or expand already-existing surveillance efforts because of the speed and accuracy that make automatic transcription and translation of significant amounts of audio communication feasible and affordable.
Whisper’s debut does not necessarily portend what OpenAI has in store for the future. While concentrating more on commercial projects like DALL-E 2 and GPT-3, the company is also working on a number of purely theoretical research lines, such as artificial intelligence systems that learn from watching videos.