There are two main choices here. Using ready to use Android apps or Mozilla DeepSpeech library.
Mozilla DeepSpeech is more suitable to transcribe hundreds of wav files to text automatically and programmatically.
Using Apps
To use the below free apps, your Android phone must have Google App and Speech Services by Google installed since they use Google offline Speech to Text and Text to Speech under the hook.
Download offline speech recognition language for Speech to Text
Depending on your device OS, the paths to Settings for Google apps below can be a little bit different. Go there and download offline languages.
Open your Google app, on the top right corner, click on your username icon and go to Settings > Voices > Offline Speech Recognition
If you do not see Offline Speech Recognition there, try this:
Phone Settings > Google > Settings for Google apps > Search, Assistant, Voice > Voice > Offline Speech Recognition
Tips: if it doesn’t start to download, try disabling Airplane mode ✈️.
Here are some free apps that worked for me at the time of this writing.
Mozilla DeepSpeech
Use open-source Mozilla DeepSpeech. We can manage to run it offline on Termux.
Currently, Termux is not powerful enough to run the model for desktop deepspeech-*-models.pbmm
, we need to use the .tflite
model.
On Termux:
cd ~/
mkdir -p s2t
cd s2t
# update link and version (0.9.3)
wget -c https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3//deepspeech-0.9.3-models.tflite
wget -c https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3//deepspeech-0.9.3-models.scorer
# change arm64 if you have a different one
wget -c https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/native_client.arm64.cpu.android.tar.xz
# Unzip
tar xf native_client.arm64.cpu.android.tar.xz
So the s2t folder tree will be something like this:
➜ s2t tree
.
├── GRAPH_VERSION -> training/deepspeech_training/GRAPH_VERSION
├── LICENSE
├── README.mozilla
├── VERSION -> training/deepspeech_training/VERSION
├── deepspeech
├── deepspeech-0.9.3-models.pbmm
├── deepspeech-0.9.3-models.scorer
├── deepspeech-0.9.3-models.tflite
├── deepspeech.h
├── generate_scorer_package
├── libc++_shared.so
├── libdeepspeech.so
├── native_client.arm64.cpu.android.tar.xz
└── s2t.sh
To run it, we have to temporarily export LD_LIBRARY_PATH=~/s2t
This export will cause other programs to fail to work. But don’t worry, simply close and restart the Terminal or Termux to remove this temporarily export.
Below is a one-line command to get the text from file.wav
.
export LD_LIBRARY_PATH=~/s2t/ && ~/s2t/deepspeech --model ~/s2t/deepspeech-0.9.3-models.tflite --scorer ~/s2t/deepspeech-0.9.3-models.scorer --audio file.wav
It should be noted that the audio length should be around 8 seconds only. This is a current drawback of DeepSpeech. You can use ffmpeg to segment the audio file.