An overview of F/OSS Audio to Text Tools. Speech rAIter

An overview of F/OSS Audio to Text Tools. Speech rAIter

May 12, 2025

An overview to the existing open source alternatives for audio to text conversion (also called Speech to Text).

Star History Chart

S2T

The process of converting spoken words into written text is called transcription.

The output of this process is also often referred to as a transcript.

ℹ️
Make sure to have the right PyTorch installed: https://pytorch.org/get-started/locally/

OpenAI Whisper

This one requires OpenAI API Key.

But its worth to give it a try.

MIT | Robust Speech Recognition via Large-Scale Weak Supervision

Ecoute

Ecoute is a live transcription tool that provides real-time transcripts for both the user’s microphone input (You) and the user’s speakers output (Speaker) in a textbox.

git clone https://github.com/SevaSk/ecoute
cd ecoute
#python -m venv solvingerror_venv #create the venv
python3 -m venv ecoute_venv #create the venv

#solvingerror_venv\Scripts\activate #activate venv (windows)
source ecoute_venv/bin/activate #(linux)


pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt
pip install whisper==1.10.0
#OPENAI_API_KEY="sk-somekey" #linux
#$Env:OPENAI_API_KEY = "sk-somekey" #PS
 #cmd

ecoute requirements

⚠️
The project has only wheels for Windows, and your system is not Windows, hence the error
Trying to bundle Ecoute 📌
python main.py
# python main.py --api
# Use the specified Python base image
FROM python:3.10-slim

# Set the working directory in the container
WORKDIR /app

# Install necessary packages
RUN apt-get update && apt-get install -y \
    git \
    build-essential

#choco install ffmpeg

# Clone the private repository
RUN git clone https://github.com/SevaSk/ecoute

WORKDIR /app/ecoute

# Copy the project files into the container
COPY . /app

RUN pip install -r requirements.txt

# Keep the container running
#CMD ["tail", "-f", "/dev/null"]

Ecoute is a live transcription tool that provides real-time transcripts for both the user’s microphone input (You) and the user’s speakers output (Speaker) in a textbox.

It also generates a suggested response using OpenAI’s GPT-3.5 for the user to say based on the live transcription of the conversation.

docker build -t ecoute .
#docker tag ecoute docker.io/fossengineer/ecoute:latest
#docker push docker.io/fossengineer/ecoute:latest
#version: '3'

services:
  ai-ecoute:
    image: fossengineer/ecoute  # Replace with your image name and tag
    container_name: ecoute
    ports:
      - "8001:8001"
    volumes:
      - ai-privategpt:/app
    command: /bin/bash -c "main.py && tail -f /dev/null" #make run
    
volumes:
  ai-ecoute:

oTranscribe

MIT | A free & open tool for transcribing audio interviews

WriteOutAI

MIT | Transcribe and translate your audio files - for free

WHISHPER

aGPL | Transcribe any audio to text, translate and edit subtitles 100% locally with a web UI. Powered by whisper models!

alt text


Conclusions

Now we have seen the differences between TTS and S2T (Transcription) frameworks out there!

Time to do cool things with them.

Like…putting together a voice assistant with Streamlit:

alt text

For TTS, lately OpenAI have made interesting upgrades: https://platform.openai.com/docs/models/gpt-4o-mini-tts

  • Voice Synthesis: TTS systems use various techniques to create synthetic voices. Early systems used concatenative synthesis (piecing together recorded human speech), while modern systems often use more advanced techniques like statistical parametric synthesis and neural network-based synthesis, which can produce more natural-sounding speech.  

Streamlit Audio

With the st.audio_input component, a lot of cool stuff can be done: https://docs.streamlit.io/develop/api-reference/widgets/st.audio_input

See st.audio_input - https://docs.streamlit.io/develop/api-reference/widgets/st.audio_input

Thanks to benji youtube video: https://www.youtube.com/watch?v=UnjaSkrfWOs

I have added a sample working script at the MultiChat project, here: https://github.com/JAlcocerT/Streamlit-MultiChat/blob/main/Z_Tests/OpenAI/Audio/audio-input.py

See also another way to do T2S with openAI: https://github.com/JAlcocerT/Streamlit-MultiChat/blob/main/Z_Tests/OpenAI/Audio/openai-tts.py

More Audio Generation

  1. https://github.com/SevaSk/ecoute
  2. https://pypi.org/project/PyAudioWPatch/#description

TRY Ecoute IN WINDOWS

python3 -m venv ecoutevenv
source ecoutevenv/bin/activate

apt install ffmpeg

git clone https://github.com/SevaSk/ecoute ./ecoute_repo
cd ecoute_repo
python -m pip install -r requirements.txt

chmod +x cygwin_cibuildwheel_build.sh

./cygwin_cibuildwheel_build.sh

#deactivate
  1. LocalAI - With voice cloning! The reason why I dont like to put my voice over the internet :)

Runs gguf, transformers, diffusers and many more models architectures. Allows to generate Text, Audio, Video, Images. Also with voice cloning capabilities.

  1. Willow on GitHub. See HeyWillow

Apache v2 | Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative

  1. Zonos: eleven labs alternative

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.

Found out about it at https://noted.lol/zonos/