Audio with AI. TTS and Voice Cloning

Audio with AI. TTS and Voice Cloning

May 10, 2025

TTS

Lets see some Text to Speech AI tools!

Star History Chart

Including Google and OpenAI solutions

LocalAI TTS

MIT | 🤖 The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed, P2P inference

The UI will be at: http://192.168.0.12:8081/

And it has a swagger API: http://192.168.0.12:8081/swagger/

Local AI UI

Coqui TTS

Local voice chatbot for engaging conversations, powered by Ollama, Hugging Face Transformers, and Coqui TTS Toolkit

MPL | 🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

Eager to spin a Coqui Text to speech local server?

docker run -d \
    --name coquitts \
    -p 5002:5002 \
    --entrypoint python3 \
    ghcr.io/coqui-ai/tts-cpu \
    TTS/server/server.py \
    --model_name \
    tts_models/en/vctk/vits

It will go with the en/vctk/vits model. But you can change it later on.

The web ui will be at port 5002:

alt text

And it works with more language than EN as well!

Deploy with the related docker-compose for CoquiTTS.

Deploy CoquiTTS with Docker | CLI Details 📌
docker exec -it coquitts /bin/bash

docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu

python3 TTS/server/server.py --list_models #To get the list of available models

python3 TTS/server/server.py --model_name tts_models/en/vctk/vits # To start a server
#python3 TTS/server/server.py --model_name tts_models/es/mai/tacotron2-DDC
services:
  tts-cpu:
    image: ghcr.io/coqui-ai/tts-cpu
    container_name: coquitts
    ports:
      - "5002:5002"
    entrypoint: /bin/bash
    tty: true
    stdin_open: true
    # Optional: Mount a volume to persist data or access local files
    # volumes:
    #   - ./local_data:/data
ℹ️
That server.py is a Flask App btw :)

More

RT Voice Cloning

https://github.com/CorentinJ/Real-Time-Voice-Cloning

BARK

Important https://pytorch.org/get-started/locally/ get the right PyTorch version!

MIT | The code for the bark-voicecloning model. Training and inference.

OpenVoice

F/OSS Voice Clone

https://github.com/CorentinJ/Real-Time-Voice-Cloning https://pythonawesome.com/clone-a-voice-in-5-seconds-to-generate-arbitrary-speech-in-real-time/

XTTS2 Local Voice Clonning

An ui for Coqui TTS

===»> /guide-xtts2-ui

https://www.youtube.com/watch?v=0vGeWA8CSyk

For the dependencies Python < 3.11:

https://github.com/BoltzmannEntropy/xtts2-ui https://github.com/BoltzmannEntropy/xtts2-ui?tab=MIT-1-ov-file#readme

I had to use this as environment…

RVC-Project

Another VC (Voice Clonning) project

https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI?tab=MIT-1-ov-file#readme

https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/docs/en/README.en.md

LocalAI Packaged

Apache v2.0 | Run all your local AI together in one package - Ollama, Supabase, n8n, Open WebUI, and more!


Conclusions

There are many ways to generate AI audio from text.

  1. Google also offers one from their: https://aistudio.google.com/prompts/new_chat

The Native Speech Generation - https://aistudio.google.com/generate-speech

For which you will need Google API Keys

  1. And OpenAI also has its own: https://platform.openai.com/playground/tts

Plus, Recemtly ive seen this usage of n8n to build AI workflows:

FAQ

https://github.com/yl4579/StyleTTS?tab=MIT-1-ov-file#readme

Below are samples for Piper, a fast and local text to speech system. Samples were generated from the first paragraph of the Wikipedia entry for rainbow.

Adding TTS to MultiChat

xTTS2

Text to Speech with xTTS2 UI, which uses the package: https://pypi.org/project/TTS/

Meaning CoquiTTS under the hood

MIT | A User Interface for XTTS-2 Text-Based Voice Cloning using only 10 seconds of speech

The model used

Installing xTTS2 with Docker. Clone audio locally.

git clone https://github.com/pbanuru/xtts2-ui.git
cd xtts2-ui

python3 -m venv venv
source venv/bin/activate

Get the right pytorch installed: https://pytorch.org/get-started/locally/

#pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt
#version: '3'

services:
  audio:
    image: python:3.10-slim
    container_name: audio
    command: tail -f /dev/null
    volumes:
      - ai_audio:/app
    working_dir: /app  # Set the working directory to /app
    ports:
      - "7865:7865"

volumes:
  ai_audio:
podman exec -it audio /bin/bash
python --version
apt update
apt install git
#git --version
git clone https://github.com/BoltzmannEntropy/xtts2-ui
cd xtts2-ui

#python -m venv venvaudio
#pip3 install torch torchvision torchaudio && pip install -r requirements.txt && pip install --upgrade TTS && streamlit run app2.py
pip3 install torch torchvision torchaudio #https://pytorch.org/get-started/locally/
pip install -r requirements.txt
pip install --upgrade TTS

streamlit run app2.py 

Streamlit UI

streamlit run app2.py

text_generation_webui_xtts


More

Making these with portainer is always easier:

sudo docker run -d -p 8000:8000 -p 9000:9000 --name=portainer --restart=always -v /var/run/docker.sock:/var/run/docker.sock -v portainer_data:/data portainer/portainer-ce

# docker stop portainer
# docker rm portainer
# docker volume rm portainer_data

Clone Audio

Taking some help from yt-dlp: https://github.com/yt-dlp/yt-dlp

Unlicensed| A feature-rich command-line audio/video downloader

yt-dlp -x --audio-format wav "https://www.youtube.com/watch?"
yt-dlp -x --audio-format wav "https://www.youtube.com/watch?v=5Em5McC_ulc"

Which I could not get working, nor: https://github.com/ytdl-org/youtube-dl

sudo apt install youtube-dl
youtube-dl -x --audio-format mp3 "https://www.youtube.com/watch?v=5Em5McC_ulc"

FAQ

https://github.com/kanttouchthis/text_generation_webui_xtts/?tab=readme-ov-file

With Oobaboga Gradio UI

And its extensions: https://github.com/oobabooga/text-generation-webui-extensions

Voice?

Generally, here you can get many ideas: https://github.com/sindresorhus/awesome-whisper

Also, in HF there are already interesting projects.

flatpak install flathub net.mkiol.SpeechNote
flatpak run net.mkiol.SpeechNote

T2S/TTS - text to speech

And now there is even prompt to audio at: google veo3