An overview of F/OSS Audio to Text Tools. Speech rAIter

speechraiter.py
requirements.txt
Dockerfile
README.md
- Auth_functions.py
- Streamlit_OpenAI.py
- readme.md
- openai_t2t.py
- openai-tts.py
- audio-input.py
- audio-input-save.py

It was all about getting the streamlit audio part right.

graph TD
    A[User records Audio] --> B(Streamlit receives Audio);
    B --> C{OpenAI Transcription};
    C --> D[Transcription Inputs to LLM];
    D --> E[Text-to-Speech == TTS];

S2T

The process of converting spoken words into written text is called transcription.

The output of this process is also often referred to as a transcript.

ℹ️

Make sure to have the right PyTorch installed: https://pytorch.org/get-started/locally/

OpenAI Whisper

This one requires OpenAI API Key.

But its worth to give it a try.

https://github.com/openai/whisper (MIT Licensed ❤️)
- https://pypi.org/project/whisper/#history

MIT | Robust Speech Recognition via Large-Scale Weak Supervision

https://www.reddit.com/r/Python/comments/xqlay2/speech_to_text_that_actually_works_my_first/

I used Whisper with my YT Workflow ↗ Sample Script OpenAI Whisper ↗

Ecoute

Project Source Code: https://github.com/SevaSk/ecoute
- License: MIT

Ecoute is a live transcription tool that provides real-time transcripts for both the user’s microphone input (You) and the user’s speakers output (Speaker) in a textbox.

git clone https://github.com/SevaSk/ecoute
cd ecoute

#python -m venv solvingerror_venv #create the venv
python3 -m venv ecoute_venv #create the venv

#solvingerror_venv\Scripts\activate #activate venv (windows)
source ecoute_venv/bin/activate #(linux)


pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

pip install -r requirements.txt
pip install whisper==1.10.0

https://pypi.org/project/whisper/#history

#OPENAI_API_KEY="sk-somekey" #linux
#$Env:OPENAI_API_KEY = "sk-somekey" #PS
 #cmd

ecoute requirements

Record Audio from speakers: https://github.com/s0d3s/PyAudioWPatch
- https://pypi.org/project/PyAudioWPatch/#history

⚠️

The project has only wheels for Windows, and your system is not Windows, hence the error

OpenAI Whisper: https://pypi.org/project/openai-whisper/#history

Trying to bundle Ecoute 📌

python main.py
# python main.py --api

# Use the specified Python base image
FROM python:3.10-slim

# Set the working directory in the container
WORKDIR /app

# Install necessary packages
RUN apt-get update && apt-get install -y \
    git \
    build-essential

#choco install ffmpeg

# Clone the private repository
RUN git clone https://github.com/SevaSk/ecoute

WORKDIR /app/ecoute

# Copy the project files into the container
COPY . /app

RUN pip install -r requirements.txt

# Keep the container running
#CMD ["tail", "-f", "/dev/null"]

Ecoute is a live transcription tool that provides real-time transcripts for both the user’s microphone input (You) and the user’s speakers output (Speaker) in a textbox.

It also generates a suggested response using OpenAI’s GPT-3.5 for the user to say based on the live transcription of the conversation.

docker build -t ecoute .
#docker tag ecoute docker.io/fossengineer/ecoute:latest
#docker push docker.io/fossengineer/ecoute:latest

#version: '3'

services:
  ai-ecoute:
    image: fossengineer/ecoute  # Replace with your image name and tag
    container_name: ecoute
    ports:
      - "8001:8001"
    volumes:
      - ai-privategpt:/app
    command: /bin/bash -c "main.py && tail -f /dev/null" #make run
    
volumes:
  ai-ecoute:

oTranscribe

https://github.com/oTranscribe/oTranscribe
- https://otranscribe.com/

MIT | A free & open tool for transcribing audio interviews

WriteOutAI

https://github.com/beyondcode/writeout.ai

MIT | Transcribe and translate your audio files - for free

WHISHPER

aGPL | Transcribe any audio to text, translate and edit subtitles 100% locally with a web UI. Powered by whisper models!

It uses LibreTranslate ↗

Whishper Audio

Whishper Docker Config 🐋 ↗

https://whishper.net/guides/install/

Flask Intro

Deployed a Flask WebApp with https and NGINX to Hertzner

Flask Sensor Display

Source Code on Github

Conclusions

Now we have seen the differences between TTS and S2T (Transcription) frameworks out there!

Time to do cool things with them.

Like…putting together a voice assistant with Streamlit:

SpeechRater

ST Audio Component DOCS ↗ Python Streamlit App Docker Config 🐋 ↗

For TTS, lately OpenAI have made interesting upgrades, with 4o-mini.

Voice Synthesis: TTS systems use various techniques to create synthetic voices. E
- arly systems used concatenative synthesis (piecing together recorded human speech), while modern systems often use more advanced techniques like statistical parametric synthesis and neural network-based synthesis, which can produce more natural-sounding speech.

Streamlit Audio

With the st.audio_input component, a lot of cool stuff can be done: https://docs.streamlit.io/develop/api-reference/widgets/st.audio_input

See st.audio_input - https://docs.streamlit.io/develop/api-reference/widgets/st.audio_input

Thanks to Benji youtube video: https://www.youtube.com/watch?v=UnjaSkrfWOs

I have added a sample working script at the MultiChat project, here: https://github.com/JAlcocerT/Streamlit-MultiChat/blob/main/Z_Tests/OpenAI/Audio/audio-input.py

More Audio Generation

TRY Ecoute IN WINDOWS

python3 -m venv ecoutevenv
source ecoutevenv/bin/activate

apt install ffmpeg

git clone https://github.com/SevaSk/ecoute ./ecoute_repo
cd ecoute_repo
python -m pip install -r requirements.txt

chmod +x cygwin_cibuildwheel_build.sh

./cygwin_cibuildwheel_build.sh

#deactivate

LocalAI - With voice cloning! The reason why I dont like to put my voice over the internet :)

https://github.com/mudler/LocalAI/
- https://github.com/mudler/LocalAI/blob/master/docker-compose.yaml

Runs gguf, transformers, diffusers and many more models architectures. Allows to generate Text, Audio, Video, Images. Also with voice cloning capabilities.

Willow on GitHub. See HeyWillow

Apache v2 | Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative

Zonos: eleven labs alternative

https://github.com/Zyphra/Zonos

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.

Found out about it at https://noted.lol/zonos/

Audio with AI. TTS and Voice Cloning More Photo and Video fun stuff