Introduction to Speech Recognition with Raspberry Pi +USB Microphone for Industrial Applications , Julius, VOSK

Speech recognition realized by connecting a microphone to a Raspberry Pi is effective for some applications and is not too expensive. If you only need speech recognition, which is different from spoken dialogue, you don’t need much machine specs.
There are many speech recognition software, services, and libraries, both paid and free. We often see API services, especially from Google, Amazon, and IBM.

Since we wanted to conduct this project in a local environment (offline), we considered “Julius,” which is highly efficient and versatile, and “VOSK,” which can be easily handled with a Python program. Both are free and easy to start. Since it is a local environment, there is no need to worry about security or privacy.
*This article introduces Japanese speech recognition.

The Raspberry Pi specs do not allow for any speech recognition software. If the Japanese speech recognition model is too large, the quicksilver processing will not be able to keep up.
I tried Julius with the basic dictation kit and VOSK with a small recognition model (small).

*The Python sample program in the article was created with ChatGPT. It is included as a way of expressing the operation test. Please understand that error handling and termination handling are incomplete.

Environment of this time

The test machine we ran again is the PL-R5M, a model that uses the Raspberry Pi Compute Module 5.

Equipment used

  • PL-R5M (Raspberry Pi CM5)
  • Jabra SPEAK 510 (USB microphone)

Environment used:

  • Raspberry Pi OS (bookworm) 64bit
  • Python 3.11.2
  • pip 23.0.1

Confirmation of USB microphone

The USB-connected microphone used in this project is the Jabra SPEAK 510, a microphone and speaker intended for use by multiple people, such as in meetings.
There are also USB devices that are simply microphones, most of which can be used by connecting a USB cable.

You can see that you can also select the Raspberry Pi OS desktop just by connecting it.

However, it is more reliable to specify the microphone device when using it in a Python program, so look up the index number to specify it first.
Sometimes there is no problem if the index number is not specified, as long as it is the default.
In this case, I researched and set up a device that also serves as a microphone and speaker.

Card number of the USB microphone:
With the Raspberry Pi OS, you can look it up with the arecord command.
The Jabra SPEAK 510 I connected this time had card number 2 and device number 0. It is hw:2,0or plughw:2,0in ALSA device name.

arecord -l

**** List of CAPTURE Hardware Devices ****
card 2: USB [Jabra SPEAK 510 USB], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

The microphone cannot be captured without specifying this card number, so we will look it up once it is connected.

Installing Julius


I’ll start with Julius.
Julius is well known, but the information is out of date and there are broken links on the official site, so I had a hard time installing it, especially on the Raspberry Pi. The documentation is available, but I think you may stumble from the installation stage.

Reading the installation instructions on the official website, I found that I need to explicitly specify ALSA for the Raspberry Pi OS when building.
Also, the Configure to build was out of date (2008) and did not recognize the Raspberry Pi 5 series (aarch64, ARM64), so we had to download a new file.

Since there is no make uninstall, the installation destination is changed to a local environment ($HOME/.local) that does not require sudo privileges for easy removal later.

Work in the following sequence:
I got it from git clone, but it is the same if you unzip it from the source. Note that the directory names are different. ( wget https://github.com/julius-speech/julius/archive/v4.6.tar.gz)

sudo apt update
sudo apt install build-essential zlib1g-dev libsdl2-dev libasound2-dev

git clone https://github.com/julius-speech/julius.git

cd julius
wget -O config.guess 'https://git.savannah.gnu.org/cgit/config.git/plain/config.guess'
wget -O config.sub   'https://git.savannah.gnu.org/cgit/config.git/plain/config.sub'
chmod +x config.guess config.sub

cp config.guess support/config.guess
cp config.sub   support/config.sub

./configure --build=aarch64-unknown-linux-gnu --with-mictype=alsa --prefix=$HOME/.local
make
make install

export LD_LIBRARY_PATH=$HOME/.local/lib:$LD_LIBRARY_PATH
export PATH=$HOME/.local/bin:$PATH

config.guess and config.sub were copied to the support directory after giving execute rights.

The options specified in configure have the following meanings

# Specify Raspberry Pi aarch64
--build=aarch64-unknown-linux-gnu

# Specify ALSA
--with-mictype=alsa

# By specifying a local directory, sudo is not required
--prefix=$HOME/.local

The last two lines of the export command are necessary to pass the specified Julius path through the locale directory.

The result of Configure, which passed successfully with the above command, looks like this.

Finally, check the version of Julius to make sure it has been installed and passed.

julius -version

This is OK.

Julius Japanese Dictation Kit

Get the Dictation Kit v4.5. There are other kits available, such as the Spoken Language Model Kit and the Lecture Speech Model Kit, but we have tested only the basic kit here.

wget https://osdn.net/dl/julius/dictation-kit-4.5.zip

But this time, neither v4.5 nor v4.4 could be downloaded by wget with the official link. (Download from OSDN)
I have no choice but to download from mirror sites. If you can’t download, please try mirrors too.

cd julius
wget https://ftp.iij.ad.jp/pub/osdn.jp/julius/71011/dictation-kit-4.5.zip
unzip dictation-kit-4.5.zip

Find the device number of the USB microphone

Check the card number of the first USB microphone connected. If this designation is not correct, it cannot be done.

arecord -l
**** List of CAPTURE Hardware Devices ****
card 2: USB [Jabra SPEAK 510 USB], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

The Jabra SPEAK 510 connected had a card number of 2. This time we will run a command to temporarily recognize it.

export ALSADEV=hw:2

The above will revert after rebooting, so if you want to use it all the time, please add a note to ~/.profile.

Run Julius

Now that we are ready, let’s run the command. I have added -input micas an explicit option. -nostripIt will work without the “*”, but it will be recognized as a zero with no statements and will fill up the terminal screen. It is a good idea to add this as well.

julius -C main.jconf -C am-gmm.jconf -nostrip -input mic

After a long message is displayed after execution, speak to the terminal display when it stops at <<< please speak >>>.

The following are the results of a few spoken runs.

The first statement was “テスト”(test).
The first time I got the “ベスト”(best) and the second time I picked up a little extra, but the “テスト”(test) is recognizable.

Next time I will say “停止”[teishi].

It was mistaken for an “天使”[tenshi] (angel) or a “変身”[henshin] (metamorphosis). The third time was recognized correctly.

Ironically, the microphone picked up my unconscious mutter of “うまく認識しない” (it doesn’t recognize it well) at the end, and it recognized it correctly on the very first try.

Accuracy was not very accurately recognized. Probably because the microphone is directional with high performance.
The voice recognition seems fine.
I have the impression that even small noises are picked up and recognized as words.

In a way, I had the impression that a cheesier microphone or a microphone at the mouth would certainly recognize me.
If it is supposed to be used in a large place where noise is loud, it may be important to choose words that do not make mistakes, in addition to eliminating noise.

What I noticed is that it is properly punctuated as a sentence.
The following indications, occasionally seen, are not recognized when the statement is short, as shown below.

<input rejected by short input>
STAT: skip CMN parameter update since last input was invalid

Three letters like “テスト”(test)or “停止”(stop)as a word would be too short and would come up several times. Rather, I felt that a slightly longer word, say 10 characters, would have a better recognition rate.

Julius server mode and Python programs

One of the possible uses in industrial applications is to temporarily stop a system when both hands are occupied or hands are dirty.
There may be foot-operated switches, etc., but it would be convenient and less time-consuming if it could be paused by simply speaking “停止”(stop) into the microphone.

The sample program is a Python program that processes “停止”(stop) and “開始”(start) statements received from the microphone.
The word “停止” is displayed as “→ stop recognition” and “start” is displayed as “→ start recognition. Other words are simply displayed as they are in the recognized word.
We assumed a situation in which the process would branch out based on a specific word.

As in this case, to process the words recognized by Julius in a Python program, Julius is run as server mode with the -moduleoption.
In server mode, you can receive voice data in XML format via TCP port (10500) as Socket communication.

On the actual terminal, start Julius in server mode first.
Then, create a sample julius_client.py and run Python in an SSH connection environment or on a separate terminal screen on the actual device to recognize and display the remarks.

Start Julius in server mode:

julius -C main.jconf -C am-gmm.jconf -module -input mic -nostrip

Run this julius_client.py by opening another terminal if you are using an actual device, or via SSH connection.

python3 julius_client.py

You can force quit with Crtl + C.

julius_client.py:

import socket
import xml.etree.ElementTree as ET

HOST = "127.0.0.1"
PORT = 10500

client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
client.connect((HOST, PORT))
print("Connected to Julius module server")

recv_buf = ""

def parse_words(xml_text):
    try:
        root = ET.fromstring(xml_text)
        return [whypo.attrib["WORD"] for whypo in root.findall(".//SHYPO/WHYPO")
                if "WORD" in whypo.attrib and whypo.attrib["WORD"] != "[s]"]
    except Exception:
        return []

print("Julius module client running... 'Stop'/'Resume' detection sample")

try:
    while True:
        data = client.recv(4096).decode("utf-8", errors="ignore")
        if not data:
            continue
        recv_buf += data
        while "<RECOGOUT>" in recv_buf and "</RECOGOUT>" in recv_buf:
            start = recv_buf.index("<RECOGOUT>")
            end = recv_buf.index("</RECOGOUT>") + len("</RECOGOUT>")
            xml = recv_buf[start:end]
            recv_buf = recv_buf[end:]
            words = parse_words(xml)
            if not words:
                continue

            sentence = " ".join(words)
            print(f"Recognition result: {sentence}")

            if "停止" in words:
                print("→ 'Stop' recognized")
            elif "再開" in words:
                print("→ 'Resume' recognized")

except KeyboardInterrupt:
    print("Exiting...")
finally:
    client.close()

Execution Results:

Connected to Julius module server
Julius module client running... '停止'/'再開' detection sample
Recognition result:  停止 。
→ 停止 recognized
Recognition result:  再会 。
Recognition result:  停止 。
→ 停止 recognized
Recognition result:  作業 を 再開 。
→ 再開 recognized
Recognition result:  さようなら 。
Recognition result:  こんにちは 。

Since “再開” [saikai] (resume) is not correctly recognized as the homonym “再会”[saikai](reunion) , so “作業を停止”(stop working) or “作業を再開”(resume working) would be better.

Julius uses only a basic dictation kit. Because of this, we could not expect much in the way of recognition accuracy. However, the recognition speed is fast, and it recognizes both words and single sentences reasonably well.

Installation of VOSK

I will try VOSK next.
This one is suitable for handling in Python and can be used immediately after installation with sounddevice.

First, download the Japanese audio model and save it in the appropriate location.
The VOSK version is 0.22. I chose the small model because it is as lightweight as possible; the big model will also work, but the 8GB model is better for memory expansion.

cd
wget https://alphacephei.com/vosk/models/vosk-model-small-ja-0.22.zip
unzip vosk-model-small-ja-0.22.zip
mv vosk-model-small-ja-0.22 model

The mv command on line 4 changes the name to model.

The VOSK itself is installed with pip.
pip is supposed to be installed in a virtual environment, so vosk_env. (PEP 668 issue)
sounddevice is also installed together, a module that works with PortAudio.

python3 -m venv vosk_env
source vosk_env/bin/activate
pip3 install vosk sounddevice

To finally exit the virtual environment, you can use deactivein a terminal to get back to where you were.

Index number of the microphone to be used with PortAudio

The Jabra SPEAK 510 was recognized as device 0 on card 2. ( hw:2,0or plughw:2,0)
I specified the device number from this information but it may not work.

Apparently, the index number was different from the PortAudio handled in Python.
There is a way to get the index number and specify it in the program, but I added it to the sample program first and made it output. (The error is still the same)

I added print(sd.query_devices())in the code to check it out:

  0 vc4-hdmi-0: MAI PCM i2s-hifi-0 (hw:0,0), ALSA (0 in, 2 out)
  1 Jabra SPEAK 510 USB: Audio (hw:2,0), ALSA (1 in, 2 out)
  2 sysdefault, ALSA (0 in, 128 out)
  3 hdmi, ALSA (0 in, 2 out)
  4 pulse, ALSA (32 in, 32 out)
* 5 default, ALSA (32 in, 32 out)

The index number was 1. It’s a bit complicated.
It was a case of using VOSK via sounddevice (PortAudio) so the numbers are not the same.

If you are using a USB microphone, it would be better to check the index number of the microphone first.
If you fall into a situation where you do not see an error but nothing happens when you speak into the microphone, etc., try running the following code to find out the index number used by PortAudio.

Code to find out the index number of the voice input device

When you run the following, you will see two index numbers handled by ALSA and Python.
If you cannot input audio with the USB microphone you are using, please use this as a reference to know the index number.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import subprocess
import sounddevice as sd
import re

# 1. Get ALSA devices from `arecord -l`
def get_alsa_devices():
    result = subprocess.run(['arecord', '-l'], stdout=subprocess.PIPE, text=True)
    devices = []
    lines = result.stdout.splitlines()
    card = None
    for line in lines:
        # Detect card line
        m_card = re.match(r'^card (\d+): (.+?)\s+\[.*\], device (\d+): (.+?)\s+\[.*\]', line)
        if m_card:
            card_num = int(m_card.group(1))
            card_name = m_card.group(2).strip()
            device_num = int(m_card.group(3))
            device_name = m_card.group(4).strip()
            devices.append({
                'card': card_num,
                'card_name': card_name,
                'device': device_num,
                'device_name': device_name
            })
    return devices

# 2. Get Python / PortAudio devices
def get_sounddevice_devices():
    devices = []
    for i, dev in enumerate(sd.query_devices()):
        if dev['max_input_channels'] > 0:
            devices.append({
                'index': i,
                'name': dev['name'],
                'max_in': dev['max_input_channels'],
                'max_out': dev['max_output_channels']
            })
    return devices

# 3. Display both

Execution Results:

 ALSA (arecord -l)
card 2, device 0: USB - USB Audio

 Python / sounddevice
index 1: Jabra SPEAK 510 USB: Audio (hw:2,0) (1 in, 2 out)
index 4: pulse (32 in, 32 out)
index 5: default (32 in, 32 out)

Look for the USB microphone name under Python / sounddevicein the execution results.
Within the sample program from the following, this index number (#1) is specified.

Python program to branch processing

Similar to the Julius program described earlier, we will try a program in VOSK that uses the words “停止” and “再開” to branch the process.
“停止” means “stop” and is pronounced teishi.
再開 means “resume” and is pronounced saikai.
In the case of VOSK, it was relatively easy to work with Python.

In Japanese, “停止” [teishi] can be written as 停止 (kanji), ていし (hiragana), or テイシ (katakana). The specification awaits the words “停止, ていし, テイシ.”
Note that the device index is specified as 1, monaural 1-channel.

You can force quit with Crtl + C.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import sounddevice as sd
import queue
import json
from vosk import Model, KaldiRecognizer

MODEL_PATH = "model"
SAMPLE_RATE = 16000
q = queue.Queue()

def callback(indata, frames, time, status):
    if status:
        print(status, flush=True)
    q.put(bytes(indata))

def main():
    print("Loading VOSK model...")
    model = Model(MODEL_PATH)
    rec = KaldiRecognizer(model, SAMPLE_RATE)

    machine_running = True

   # USB microphone: card 1, device 0, mono
    with sd.RawInputStream(
            device=1,
            samplerate=SAMPLE_RATE,
            blocksize=8000,
            dtype="int16",
            channels=1,
            callback=callback):
        print("Speech recognition started. Please say '停止' (stop) or '再開' (resume).")
        while True:
            data = q.get()
            if rec.AcceptWaveform(data):
                result = json.loads(rec.Result())
                text = result.get("text", "")
                if text:
                    print(f"Recognition result: {text}")
                    if any(word in text for word in ["停止", "ていし", "テイシ"]) and machine_running:
                        print(">>> 停止コマンド検出!")
                        machine_running = False
                    elif any(word in text for word in ["再開", "さいかい", "サイカイ"]) and not machine_running:
                        print(">>> Resume command detected!")
                        machine_running = True

if __name__ == "__main__":
    main()

Looking at the results, there are no errors, but the recognition result is that “停止”[teishi] has inevitably become “天使”[tenshi], which means “angel.”

Execution Results:

Recognition result: 天使
Recognition result: 天使
Recognition result: 開始
Recognition result: いや 天使
Recognition result: 天使
Recognition result: いい
Recognition result: て いい し
Recognition result: 天使
Recognition result: いい し

This could happen because it is a small model and depends on the type of microphone and environment.

This is because high-performance microphones, in particular, are directional and the accuracy of recognition may vary depending on the location of the sound source.
The accuracy of recognition is also reduced in noisy environments.

A way to improve accuracy in VOSK is to apply lexical correction.

VOSK allows you to specify the words you want to wait for. → rec = KaldiRecognizer(model, SAMPLE_RATE, '["停止", "再開"]')
This way, words other than “停止”[teishi] and “再開” [saikai]are ignored, which can greatly improve accuracy.

Modified Python program

In the end, the following code worked as expected.
Try saying “再開”[saikai] after the “停止” [teishi]statement several times to see if it is recognized at one time and how quickly the process stops.

停止[teishi] or 再開[saikai] is indicated after about 1 second for the small model(about 2 seconds for the big model ).
If the case does not require too high accuracy, the system can “停止”(stop) from the remark in a manner of speaking.

You can force quit with Crtl + C.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import sounddevice as sd
import queue
import json
from vosk import Model, KaldiRecognizer

# Settings
MODEL_PATH = "model"     # Place vosk-model-small-ja-0.22 in model/
SAMPLE_RATE = 16000       # Small model recommends 16kHz
q = queue.Queue()

# Command handling functions
def stop_process():
    print(">>> Executing machine STOP process")

def resume_process():
    print(">>> Executing machine RESUME process")

# Audio data callback
def callback(indata, frames, time, status):
    if status:
        print(status, flush=True)
    # With RawInputStream, indata is a _cffi_backend.buffer
    # Convert to bytes before passing to VOSK
    q.put(bytes(indata))

def main():
    print("Loading VOSK model...")
    model = Model(MODEL_PATH)

    # Limit recognition vocabulary to '停止' and '再開'
    rec = KaldiRecognizer(model, SAMPLE_RATE, '["停止","再開"]')

    machine_running = True

    # USB microphone: card 1, mono 1ch
    with sd.RawInputStream(
            device=1,           # USB mic card number
            samplerate=SAMPLE_RATE,
            blocksize=8000,
            dtype="int16",
            channels=1,         # Mono, 1 channel
            callback=callback):
        print("Speech recognition started. Please say '停止' (stop) or '再開' (resume).")

        while True:
            data = q.get()
            if rec.AcceptWaveform(data):
                result = json.loads(rec.Result())
                text = result.get("text", "")
                if text:
                    print(f"認識結果: {text}")
                    # 判定
                    if "停止" in text and machine_running:
                        stop_process()
                        machine_running = False
                    elif "再開" in text and not machine_running:
                        resume_process()
                        machine_running = True

if __name__ == "__main__":
    main()

Execution Results:

Speech recognition started. Please say '停止' (stop) or '再開' (resume).
Recognition result: 停止
>>> Executing machine STOP process
Recognition result: 停止
Recognition result: 再開
>>> Executing machine RESUME process

As you can see from the recognition results, both 停止[teishi] and 再開[saikai] are recognized.
The second 停止[teishi] is not incorrect with the result that nothing is processed even if it is recognized as a 停止[teishi] during stop.

By speaking with 再開[saikai], the branch that executes the process is also successful. Although the response time is slow, it is smooth with no errors.

This time it was short words such as “停止” and “再開”. The reason why the word “停止”[teishi] is sometimes mistaken for “天使”[tenshi](which means angel) is simply because the Japanese recognition model we used is small.

If the recognition is unreliable, it is also a good idea to choose a word that is not easily mistaken for another word, such as “作業停止”(work stoppage), to make the word unique.
VOSK can also handle custom dictionaries. (Julius can also have dictionaries, although in a different way.)

Although it is possible to load a word list from an external file, it is more convenient to specify a list of words in the sample code as follows.

# Restrict recognition vocabulary with a grammar list (phrase specification)
grammar = '["停止", "再開", "機械を停止して", "再開してください"]'

rec = KaldiRecognizer(model, SAMPLE_RATE, grammar)

This time, stop and resume is just indicated by a print statement, but you can add some processing to it and see how it runs.

Practical speech recognition

If program processing can be branched out by voice recognition, it can be handled not only for industrial applications but also as a switch button substitute that does not require complicated operations.
An additional USB-connected microphone is all that is needed, and it is practical to control the system with phrases such as a word or a sentence.

Julius has a useful server mode. Programming is a bit more complicated, but it seems like the best way to link something together. It is very flexible to incorporate.
VOSK, on the other hand, is easy to handle in a Python environment. It is best suited for Raspberry Pi, which has a Python environment from the beginning.

Both felt that the Raspberry Pi was sufficient for the processing specs.
If the Raspberry Pi can carry some of the load, the advantage will be power savings and a smaller body size.

Julius: https://github.com/julius-speech/julius
VOSK:
https://alphacephei.com/vosk/models


Article contributed by Raspida

Raspberry Pi information site that even non-engineers can enjoy using raspida.com a Raspberry Pi information site that even non-engineers can enjoy and handle. He also contributes technical blog articles to the PiLink site on the Raspberry Pi for industrial use.