Tech Background: Babbel Speech Recognition
Interview with Technical Director Thomas Holl
Speech recognition is the exciting new feature at Babbel. It’s not only fun – it’s also amazingly efficient for learning a new language. But how does it work? I got the low down from our Technical Director Thomas.
Crisi: What does the new speech recognition tool do?
Thomas: Basically, we use pronunciation samples recorded by our native speaking course editors and compare your pronunciation to theirs. As always with Babbel, you get instant feedback. The closer your pronunciation is to this example, the more points you get on a scale from 0 to 100. If you get more than 50 points, you’re good enough to be generally understood.
Crisi: But if you just compare two sounds, is that really speech recognition?
Thomas: Sure, we recognize what you say. We’re now sitting in front of the screen and we are talking but you see that the score is 0 all the time. Now, try saying arrivederci.
Thomas: Nice, 78 points. Better than Aldo Raine in “Inglorious Basterds” (see details here). Remember the hilarious scene where Brad Pitt is trying to speak Italian? We ran his pronunciation through our analysis and as you might expect he scored pretty low. But I’m digressing, sorry. Back to our little test. Your pronunciation is about 78% exact compared to our reference sample. That’s pretty good.
Crisi: Still, it’s only about comparing sounds, not about understanding what I say.
Thomas: Well, there are different sub-types of speech recognition. One is speech-to-text or voice control. That’s what you’d use to enter text or commands if you can’t use a keyboard. Recognizing words and evaluating their pronunciation is another sub-type, and that’s the technology that makes sense for language learning. We can use it for pronunciation training and for building new interactive exercises.
Crisi: So, what’s the technical challenge in this sub-type of speech recognition?
Thomas: Well, it’s not as easy as it sounds – no pun intended. It’s actually not enough to just compare two sounds. It’s a little like telling how similar two people look in two different photos. The audio samples are usually pretty different: a woman has a higher voice than a man and the tempo of speech also differs a lot. And then you have a number of artifacts…
Thomas: Noises and characteristics that are caused by the environment or the technical setup: rumbling, hissing, other sounds mixing into the voice. Most people don’t have a high-end microphone connected to their computer and in our case we just use the built-in mic on my laptop. The audio quality of what the system is hearing is pretty poor.
Crisi: So to make the speech recognition work properly, our users need to have a good mic and be in a quiet room?
Thomas: No, that’s the point: we can also work with cheap microphones and filter out noise in the immediate environment. That’s part of the challenge.
Crisi: Sounds like a lot of filtering and levelling…
Thomas: Yes, that also, but there’s more: We have to distil the “core” of the voice sample and then match that to the original. To do that, the system needs to figure out when you start and stop speaking. You don’t have to press any key to start and stop recording; we do the matching in real-time.
Crisi: So everything we say into the system here is somehow analyzed?
Thomas: Right. Just look at the level: every sound input is analyzed and matched to the sound we’re looking for. In this case, arrivederci.
Crisi: 55 points
Thomas: Ok, yours is better than mine. But you see that the word was recognized among all the other things we said.
Crisi: Is this unique technology? Are there other software product that do this?
Thomas: There are a number of software products that do have speech recognition. Some of them also are of decent quality.
Crisi: So what’s so special about the Babbel speech recognition?
Thomas: Well, it’s online and works in your browser.
Crisi: Does this mean that everything we say here is sent to the Babbel servers and analyzed there?
Thomas: No, the whole audio processing is done instantly, directly in the browser. We don’t have to send the audio to the server and that’s why we can give instant feedback.
Crisi: Do I have to install a plugin or something?
Thomas: You don’t. It’s all done in Flash. 97% of all browsers have the Flash plugin pre-installed. As we use the latest version, you might have to do an update, but that’s very quick. Other than that, you just need a microphone like the one that’s built into my laptop.
Crisi: Babbel has been online since January 2008. Why did it take so long to add this feature?
Thomas: We needed the new Flash Player 10.1 because before that it wasn’t possible to do audio processing locally. It would have been necessary to either send all the audio to the server for analyses or to use a custom browser plugin.
Crisi: What’s wrong with a custom browser plugin?
Thomas: First of all, you have to install new software on your computer. And then you have compatibility issues. There are some rare solutions that offer real-time speech recognition in a browser plugin, but most of them won’t work on your Mac and none of them are compatible with all browsers. Flash is already there, the plugin works fine and it’s available for all platforms.
Crisi: How about the iPhone? You can’t use Flash technology on that platform, can you?
Thomas: No, but the Babbel iPhone apps work natively on the iPhone anyway.
Thomas: The Babbel apps are built specifically for the iPhone and don’t need a browser or plugin to work. That’s called a “native” application. We can build our algorithm directly into the app.
Crisi: That’s not related to Native Instruments, the software company you used to work for?
Thomas: (laughs): No, not directly. But for being an audio software company, Native Instruments definitely is a great name because the software works natively on the computer.
Crisi: I guess we don’t have to understand that completely. But speaking of audio software: has your audio expertise (along with that of the other Babbel founders) been crucial for this new feature or is it something entirely different than building DJ tools?
Thomas: Both. Of course working on beat detection and time stretching for music and building a speech recognition tool are two different things. On the other hand, we couldn’t have done this in-house without our background.
Crisi: So who actually implemented the new feature?
Thomas: Most of it was done by Toine Diepstraten, one of the Babbel founders. He and I started working together on audio software in our first company, d-lusion, more than 10 years ago. Toine is one of the best developers and audio specialists I’ve ever met. It’s fantastic to have him on board for this project. He did have to do quite some research but without his expertise, this would never have been possible. But this way we have state-of-the art technology that can compare with any other implementation.
Crisi: You sound very convinced
Thomas: From a technical point of view, this is a great piece of software. We actually got some recognition from Adobe, the makers of the Flash Player. They were pretty impressed by our solution.
Crisi: Will this be a focus for Babbel from now on, or do you plan to work on other types of features?
Thomas: It is a very important feature because now we can do everything online that traditional e-learning software can do locally. And we don’t need installation or updates and we have a very lively online community that goes together with the self-directed learning…
Thomas: It’s important but it’s not the end. We’ll keep working and adding new features.
Crisi: Can you say what’s next for Babbel?
Thomas: Sorry, but for that we’ll have to turn off the mic.
Crisi: No problem.