An overview of F/OSS Audio to Text Tools. Speech rAIter

An overview of F/OSS Audio to Text Tools. Speech rAIter

May 12, 2025

An overview to the existing open source alternatives for audio to text conversion (also called Speech to Text).

But first: how to create a PoC to help people get better at public speaking.

The Speech Rater

How about using streamlit to input and output audio?

Well, plugging LLMs to that is kind of easy:

Speech Rater PoC Streamlit

    • speechraiter.py
    • requirements.txt
    • Dockerfile
    • README.md
      • Auth_functions.py
      • Streamlit_OpenAI.py
      • readme.md
      • openai_t2t.py
      • openai-tts.py
      • audio-input.py
      • audio-input-save.py
  • It was all about getting the streamlit audio part right.

    graph TD
        A[User records Audio] --> B(Streamlit receives Audio);
        B --> C{OpenAI Transcription};
        C --> D[Transcription Inputs to LLM];
        D --> E[Text-to-Speech (T2S)];

    S2T

    The process of converting spoken words into written text is called transcription.

    The output of this process is also often referred to as a transcript.

    ℹ️
    Make sure to have the right PyTorch installed: https://pytorch.org/get-started/locally/

    OpenAI Whisper

    This one requires OpenAI API Key.

    But its worth to give it a try.

    MIT | Robust Speech Recognition via Large-Scale Weak Supervision

    Ecoute

    Ecoute is a live transcription tool that provides real-time transcripts for both the user’s microphone input (You) and the user’s speakers output (Speaker) in a textbox.

    git clone https://github.com/SevaSk/ecoute
    cd ecoute
    #python -m venv solvingerror_venv #create the venv
    python3 -m venv ecoute_venv #create the venv
    
    #solvingerror_venv\Scripts\activate #activate venv (windows)
    source ecoute_venv/bin/activate #(linux)
    
    
    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
    pip install -r requirements.txt
    pip install whisper==1.10.0
    #OPENAI_API_KEY="sk-somekey" #linux
    #$Env:OPENAI_API_KEY = "sk-somekey" #PS
     #cmd

    ecoute requirements

    ⚠️
    The project has only wheels for Windows, and your system is not Windows, hence the error
    Trying to bundle Ecoute 📌
    python main.py
    # python main.py --api
    # Use the specified Python base image
    FROM python:3.10-slim
    
    # Set the working directory in the container
    WORKDIR /app
    
    # Install necessary packages
    RUN apt-get update && apt-get install -y \
        git \
        build-essential
    
    #choco install ffmpeg
    
    # Clone the private repository
    RUN git clone https://github.com/SevaSk/ecoute
    
    WORKDIR /app/ecoute
    
    # Copy the project files into the container
    COPY . /app
    
    RUN pip install -r requirements.txt
    
    # Keep the container running
    #CMD ["tail", "-f", "/dev/null"]

    Ecoute is a live transcription tool that provides real-time transcripts for both the user’s microphone input (You) and the user’s speakers output (Speaker) in a textbox.

    It also generates a suggested response using OpenAI’s GPT-3.5 for the user to say based on the live transcription of the conversation.

    docker build -t ecoute .
    #docker tag ecoute docker.io/fossengineer/ecoute:latest
    #docker push docker.io/fossengineer/ecoute:latest
    #version: '3'
    
    services:
      ai-ecoute:
        image: fossengineer/ecoute  # Replace with your image name and tag
        container_name: ecoute
        ports:
          - "8001:8001"
        volumes:
          - ai-privategpt:/app
        command: /bin/bash -c "main.py && tail -f /dev/null" #make run
        
    volumes:
      ai-ecoute:

    oTranscribe

    MIT | A free & open tool for transcribing audio interviews

    WriteOutAI

    MIT | Transcribe and translate your audio files - for free

    WHISHPER

    aGPL | Transcribe any audio to text, translate and edit subtitles 100% locally with a web UI. Powered by whisper models!

    alt text


    Conclusions

    Now we have seen the differences between TTS and S2T (Transcription) frameworks out there!

    Time to do cool things with them.

    Like…putting together a voice assistant with Streamlit:

    SpeechRater

    For TTS, lately OpenAI have made interesting upgrades, with 4o-mini.

    • Voice Synthesis: TTS systems use various techniques to create synthetic voices. Early systems used concatenative synthesis (piecing together recorded human speech), while modern systems often use more advanced techniques like statistical parametric synthesis and neural network-based synthesis, which can produce more natural-sounding speech.  

    Streamlit Audio

    With the st.audio_input component, a lot of cool stuff can be done: https://docs.streamlit.io/develop/api-reference/widgets/st.audio_input

    See st.audio_input - https://docs.streamlit.io/develop/api-reference/widgets/st.audio_input

    Thanks to Benji youtube video: https://www.youtube.com/watch?v=UnjaSkrfWOs

    I have added a sample working script at the MultiChat project, here: https://github.com/JAlcocerT/Streamlit-MultiChat/blob/main/Z_Tests/OpenAI/Audio/audio-input.py

    See also another way to do T2S with openAI: https://github.com/JAlcocerT/Streamlit-MultiChat/blob/main/Z_Tests/OpenAI/Audio/openai-tts.py

    More Audio Generation

    1. https://github.com/SevaSk/ecoute
    2. https://pypi.org/project/PyAudioWPatch/#description

    TRY Ecoute IN WINDOWS

    python3 -m venv ecoutevenv
    source ecoutevenv/bin/activate
    
    apt install ffmpeg
    
    git clone https://github.com/SevaSk/ecoute ./ecoute_repo
    cd ecoute_repo
    python -m pip install -r requirements.txt
    
    chmod +x cygwin_cibuildwheel_build.sh
    
    ./cygwin_cibuildwheel_build.sh
    
    #deactivate
    1. LocalAI - With voice cloning! The reason why I dont like to put my voice over the internet :)

    Runs gguf, transformers, diffusers and many more models architectures. Allows to generate Text, Audio, Video, Images. Also with voice cloning capabilities.

    1. Willow on GitHub. See HeyWillow

    Apache v2 | Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative

    1. Zonos: eleven labs alternative

    Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.

    Found out about it at https://noted.lol/zonos/