My personal big challenge here at Tryolabs is bringing the power of AI to the hands of users in a gorgeous and comfortable way and part of that challenge is exploring ways to apply known machine learning solutions to the mobile side.
This time my shot was ambitious:
“Performing decent Automatic Speech Recognition (ASR) with in-device processing.”
There are many papers and books that study how an ASR solution can be optimised, those processes include hardcore code level optimisations plus all kind of statistical model tweaks in different levels (acoustic model, language model, etc). Some of them are targeted to mobile devices and all of them were written by people far more clever than me, so my intention is not revolutionizing the field. Instead my idea is using existing tools to get a decent performance without a server side component.
The first thing that appears when you start scratching the speec-recognition-on-mobile-devices surface is OpenEars. It’s a complete offline ASR solution with many functionalities and it’s really easy to use. You can get a content independent grammar recognizer (e.g voice commands) up an running in minutes. It also provide a set of plugins that extend Pocketsphinx that you can acquire.
OpenEars is based on CMUSphinx, the speech recognition engine part of another bigger project Speech at CMU of Carnegie Mellon University. This engine has a library called Pocketsphinx that is a lightweight recognizer library written in C conceived to be used in environments like mobile devices.
With Pocketsphinx available the only thing to do was trying it on the device and see how it goes. After a hard day of
configure; make; runtest I ended up with a functional library that runs on desktop, simulators or devices. My first tests with my own voice weren’t so encouraging, it seems the model shipped with Pocketsphinx isn’t adapted to my speaking style, so the results were not so good.
Anyway I though it will be nice to have a less-pure-C way to use Pocketsphinx (and there already is and ObjC way with OpenEars) so I decided to create
TLSphinx: a Swift framework that wraps Pocketsphinx in a neat way.
TLSphinx I first needed to build a Clang Module for Pocketsphinx (and Sphinx Base, the core of the engine). So I wrote a
module.modulemap by hand and told XCode where to find it. Of course XCode finds it, compiles and links it, but don’t expect more than that, in particular with Swift on the other side. Once I get Pocketsphinx module working I hitted the next stone. Bridging ObjC with C is almost trivial, with Swift it’s a complete different story, and poorly documented too. You must use Swift types to describe C declarations which is nice once you hit the right types for your C expressions. This was hard to me because XCode didn’t give me any feedback so I needed to check Pocketsphinx’s headers and translate types from there. Thanks again XCode :-|
Now with all the pieces in one place lets see what
TLSphinx can do! It’s a really simple framework with two main functions: decode speech from a file and decode speech from the mic. It also provides a
Config class that supports all parameters of the underlying
cmd_ln_t opaque structure of Pocketsphinx, and a
Hypotesis class that represents the decode result.
Here is an example of decoding a file stored in the device.
This will return the text go forward then meters decoded from the audio file shipped with Pocketsphinx for test. The decode from the mic is almost the same but instead of
decodeSpeechAtPath() we use:
The callback for
startDecodingSpeech() is called for each decoding.
So what do we have?
With this simple API you can get a nice first approach to ASR with Swift using Pocketsphinx engine. The possibilities are much more than what we explored here. My idea is to keep adding features written in Swift to
TLSphinx that combines uses if Pocketsphinx.
Visit the repo to check how
TLSphinx works and how to integrate it with your project. Any kind of contributions is an honor.
Let see how far we get :)