Audio with AI. TTS and Voice Cloning
TTS
Lets see some Text to Speech AI tools!
Including Google and OpenAI solutions
LocalAI TTS
MIT | 🤖 The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed, P2P inference
- But we came for the text to audio capabilities: https://localai.io/features/text-to-audio/
- And a very interesting API: http://192.168.1.11:8081/swagger/index.html
The UI will be at: http://192.168.0.12:8081/
And it has a swagger API: http://192.168.0.12:8081/swagger/
Coqui TTS
Local voice chatbot for engaging conversations, powered by Ollama, Hugging Face Transformers, and Coqui TTS Toolkit
MPL | 🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Eager to spin a Coqui Text to speech local server?
docker run -d \
--name coquitts \
-p 5002:5002 \
--entrypoint python3 \
ghcr.io/coqui-ai/tts-cpu \
TTS/server/server.py \
--model_name \
tts_models/en/vctk/vits
It will go with the en/vctk/vits
model. But you can change it later on.
The web ui will be at port 5002
:
And it works with more language than EN as well!
Deploy with the related docker-compose for CoquiTTS.
Deploy CoquiTTS with Docker | CLI Details 📌
docker exec -it coquitts /bin/bash
docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu
python3 TTS/server/server.py --list_models #To get the list of available models
python3 TTS/server/server.py --model_name tts_models/en/vctk/vits # To start a server
#python3 TTS/server/server.py --model_name tts_models/es/mai/tacotron2-DDC
services:
tts-cpu:
image: ghcr.io/coqui-ai/tts-cpu
container_name: coquitts
ports:
- "5002:5002"
entrypoint: /bin/bash
tty: true
stdin_open: true
# Optional: Mount a volume to persist data or access local files
# volumes:
# - ./local_data:/data
server.py
is a Flask App btw :)More
RT Voice Cloning
https://github.com/CorentinJ/Real-Time-Voice-Cloning
BARK
Important https://pytorch.org/get-started/locally/ get the right PyTorch version!
MIT | The code for the bark-voicecloning model. Training and inference.
OpenVoice
F/OSS Voice Clone
https://github.com/CorentinJ/Real-Time-Voice-Cloning https://pythonawesome.com/clone-a-voice-in-5-seconds-to-generate-arbitrary-speech-in-real-time/
- Open Voice + Google Colab
- Also locally
XTTS2 Local Voice Clonning
An ui for Coqui TTS
===»> /guide-xtts2-ui
https://www.youtube.com/watch?v=0vGeWA8CSyk
For the dependencies Python < 3.11:
https://github.com/BoltzmannEntropy/xtts2-ui https://github.com/BoltzmannEntropy/xtts2-ui?tab=MIT-1-ov-file#readme
I had to use this as environment…
RVC-Project
Another VC (Voice Clonning) project
https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI?tab=MIT-1-ov-file#readme
https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/docs/en/README.en.md
LocalAI Packaged
Apache v2.0 | Run all your local AI together in one package - Ollama, Supabase, n8n, Open WebUI, and more!
Conclusions
There are many ways to generate AI audio from text.
- Google also offers one from their: https://aistudio.google.com/prompts/new_chat
The Native Speech Generation - https://aistudio.google.com/generate-speech
For which you will need Google API Keys
- And OpenAI also has its own: https://platform.openai.com/playground/tts
Plus, Recemtly ive seen this usage of n8n to build AI workflows:
FAQ
https://github.com/p0p4k/vits2_pytorch https://github.com/p0p4k/vits2_pytorch?tab=MIT-1-ov-file#readme
https://github.com/yl4579/StyleTTS?tab=MIT-1-ov-file#readme
Below are samples for Piper, a fast and local text to speech system. Samples were generated from the first paragraph of the Wikipedia entry for rainbow.
Adding TTS to MultiChat
xTTS2
Text to Speech with xTTS2 UI, which uses the package: https://pypi.org/project/TTS/
Meaning CoquiTTS under the hood
MIT | A User Interface for XTTS-2 Text-Based Voice Cloning using only 10 seconds of speech
The model used
- https://coqui.ai/cpml.txt
- Hardware needed: works with CPU ✅
Installing xTTS2 with Docker. Clone audio locally.
git clone https://github.com/pbanuru/xtts2-ui.git
cd xtts2-ui
python3 -m venv venv
source venv/bin/activate
Get the right pytorch installed: https://pytorch.org/get-started/locally/
#pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt
#version: '3'
services:
audio:
image: python:3.10-slim
container_name: audio
command: tail -f /dev/null
volumes:
- ai_audio:/app
working_dir: /app # Set the working directory to /app
ports:
- "7865:7865"
volumes:
ai_audio:
podman exec -it audio /bin/bash
python --version
apt update
apt install git
#git --version
git clone https://github.com/BoltzmannEntropy/xtts2-ui
cd xtts2-ui
#python -m venv venvaudio
#pip3 install torch torchvision torchaudio && pip install -r requirements.txt && pip install --upgrade TTS && streamlit run app2.py
pip3 install torch torchvision torchaudio #https://pytorch.org/get-started/locally/
pip install -r requirements.txt
pip install --upgrade TTS
streamlit run app2.py
Streamlit UI
streamlit run app2.py

text_generation_webui_xtts
More
Making these with portainer is always easier:
sudo docker run -d -p 8000:8000 -p 9000:9000 --name=portainer --restart=always -v /var/run/docker.sock:/var/run/docker.sock -v portainer_data:/data portainer/portainer-ce
# docker stop portainer
# docker rm portainer
# docker volume rm portainer_data

Clone Audio
Taking some help from yt-dlp: https://github.com/yt-dlp/yt-dlp
Unlicensed| A feature-rich command-line audio/video downloader
yt-dlp -x --audio-format wav "https://www.youtube.com/watch?"
yt-dlp -x --audio-format wav "https://www.youtube.com/watch?v=5Em5McC_ulc"
Which I could not get working, nor: https://github.com/ytdl-org/youtube-dl
sudo apt install youtube-dl
youtube-dl -x --audio-format mp3 "https://www.youtube.com/watch?v=5Em5McC_ulc"
FAQ
https://github.com/kanttouchthis/text_generation_webui_xtts/?tab=readme-ov-file
With Oobaboga Gradio UI
And its extensions: https://github.com/oobabooga/text-generation-webui-extensions
Voice?
Generally, here you can get many ideas: https://github.com/sindresorhus/awesome-whisper
Also, in HF there are already interesting projects.
ecoute (OpenAI API needed)
Meeper (OpenAI API needed)
Bark
Whisper - https://github.com/openai/whisper
- Free and Open Source Machine Translation API. Self-hosted, offline capable and easy to setup.
Linux Desktop App:
flatpak install flathub net.mkiol.SpeechNote
flatpak run net.mkiol.SpeechNote
T2S/TTS - text to speech
- elevenlabs - https://elevenlabs.io/pricing
- https://azure.microsoft.com/en-us/products/ai-services/text-to-speech
And now there is even prompt to audio at: google veo3