Talk to your DataFrames with PandasAI

July 12, 2024

Let’s have a look to PandasAI - A Python Project that allow us to chat with our data.

By incorporating retrieved information, RAG frameworks can generate more informative, accurate, and contextually relevant responses compared to models (LLMS) that rely solely on their internal knowledge.

You will need Python installed in your PC and a basic understanding of how to manage properly python Dependencies to make it run.

The PandasAI Project

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc).

PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.

The PandasAI Source Code at Github
- The PandasAI Docs
- License: Partly MIT ❗

Open Source AI Agents for Data Analysis

Why PandasAI?

It’s all about smartdataframes.

PandasAI SmartDataFrames are essentially regular Pandas DataFrames with an added layer of functionality that allows you to interact with them using natural language.

Smart DataFrame Example

Interesting Project with PandasAI

Let’s give it a try to Pandas AI:

Option 1 - with a Python Venv 👇

python -m venv pdai #create the venv | python3 if you are on linux

pdai\Scripts\activate #activate venv (windows)
source pdai/bin/activate #(linux)

pip install pandasai==2.0.37 #pip install -r requirements.txt #all at once

Option 2 - With Google Colaboratory

I found good working examples at:

https://github.com/TirendazAcademy/PandasAI-Tutorials
- https://www.youtube.com/watch?v=bw_e6xgGSTY

Colab Examples Pandas AI

You will need Python installed and the PandasAI package:

https://pypi.org/project/pandasai/2.2.12/

pip install pandasai==2.2.12

With OpenAI and PandasAI

export OPENAI_API_KEY="sk-..." #linux

PandasAI with Ollama and Streamlit

git clone https://github.com/TirendazAcademy/PandasAI-Tutorials
cd ./PandasAI-Tutorials/PandasAI-with-Llama3
#pip install -r requirements.txt #all at once
pip install pandasai==2.0.36 streamlit==1.33 PyYAML==6.0.1
streamlit run app.py

We will need Ollama running ⏬

We can spin up Ollama with Docker like so:

version: '3'
services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    command: ollama run tinyllama #or any other model
    #command: tail -f /dev/null    

volumes:
  ollama_data:

If you dont want to execute a command in the docker-compose, do it with:

#docker run -d --name python_container_for_dev -p 8502:8501 pythonaicontainer tail -f /dev/null
docker exec -it ollama /bin/bash
ollama run tinyllama

Now the ollama API es ready, serving the tinyllama model at: localhost:11434

#linux
curl http://localhost:11434/api/generate -d '{
  "model": "tinyllama",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

And the API is ready here:

#Windows
curl.exe -X POST http://192.168.3.200:11434/api/generate -d "{\"model\":\"tinyllama\",\"prompt\":\"Who are you?\",\"stream\":false}"

PandasAI with Groq

https://github.com/TirendazAcademy/PandasAI-Tutorials/blob/main/PandasAI-with-Groq.ipynb
- You can open them in Colab - https://colab.research.google.com/github/TirendazAcademy/PandasAI-Tutorials/blob/main/PandasAI-with-Groq.ipynb

conda create -n genai python=3.10

conda activate genai
python --version

#python -m pip install -r requirements.txt #all at once

conda deactivate
conda env remove --name myenv

PandasAI with OpenAI Key ⏬

import pandas as pd #https://github.com/thepycoach/pandasai-tutorial/blob/main/PandasAI-Tutorial.ipynb
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI

df = pd.read_csv("supermarket_sales.csv")
df = df[['Gender', 'Product line', 'Total']]

OPENAI_API_KEY = "your-api-key"
llm = OpenAI(api_token=OPENAI_API_KEY)

pandas_ai = PandasAI(llm)

pandas_ai.run(df, prompt="Which products are in Product Line")
pandas_ai.run(df, prompt="Plot a barplot that shows the total spent by each gender")

Conclusion

We have seen how to use PandasAI to Chat with our Data.

But PandasAI is not the only Python Package that we can use for that - Other Popular Options are:

LangChain ❤️
LlamaIndex ❤️

If you are not sure which LLM will be the best, you can give it a try to:

Combine strong/weak models with: https://github.com/lm-sys/RouteLLM

A framework for serving and evaluating LLM routers - save LLM costs without compromising quality!

https://github.com/BerriAI/litellm

Call all LLM APIs using the OpenAI format. Use Bedrock, Azure, OpenAI, Cohere, Anthropic, Ollama, Sagemaker, HuggingFace, Replicate (100+ LLMs)

If you would like to have a look to a project that implements RAG with UI:

A RAG With UI - QUIVR ⏬

https://github.com/QuivrHQ/quivr - Apache License Version 2.0

Open-source RAG Framework for building GenAI Second Brains 🧠 Build productivity assistant (RAG) ⚡️🤖 Chat with your docs (PDF, CSV, …) & apps using Langchain, GPT 3.5 / 4 turbo, Private, Anthropic, VertexAI, Ollama, LLMs, Groq that you can share with users ! Efficient retrieval augmented generation framework

PandasAI Excel as DB with Groq

https://mer.vin/2024/05/pandas-ai-database-excel-chainlit/

FAQ

What it is a RAG? ⏬

In a RAG framework, when a user poses a question or provides a prompt, the system first retrieves relevant information from a large corpus of text data.

The retrieved information is then used to augment the input prompt, providing additional context and knowledge to the language generation model.

The language generation model, often based on deep learning architectures like transformers, takes the augmented prompt and generates a response or continuation of the text.

I was also explaining about RAGs when having a look to the EmbedChain Project

The Project was under Apache v2 Licensed, but now have been absorbed by mem0

Have you heard about Koalas? ⏬

Koalas is a Python library developed by Databricks to provide a pandas-like API for Apache Spark, allowing you to scale your pandas code to big data.

It’s designed to make the transition from pandas to Spark easier by providing an interface that’s familiar to pandas users.

How to run Ollama together with a Python container

Create and run a Docker Stack ⏬

FROM python:3.10-slim

# Install git
RUN apt-get update && apt-get install -y git nano

# Set up the working directory
WORKDIR /app

docker build --no-cache -t pythonaicontainer . #> build_log.txt 2>&1

version: '3'

services:
  ollama:
    image: ollama/ollama
    container_name: ollamapython
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    #command: ollama run tinyllama

  dev-pythonaicontainer:
    image: pythonaicontainer
    container_name: pythonaicontainer
    volumes:
      - ai_pythonaicontainer:/app
    working_dir: /app
    command: tail -f /dev/null
    ports:
      - "8503:8501"

volumes:
  ollama_data:
  ai_pythonaicontainer:

Access the containers with:

podman exec -it ollamapython /bin/bash
ollama --version
ollama pull tinyllama #https://ollama.com/library/tinyllama
#ollama list

podman exec -it pythonaicontainer /bin/bash
python --version
git clone https://github.com/TirendazAcademy/PandasAI-Tutorials
cd ./PandasAI-Tutorials/PandasAI-with-Llama3
pip install -r requirements.txt #all at once
streamlit run app.py

Other F/OSS Ways to Chat with your Data

Mem0, ex-EmbedChain

The memory layer for Personalized AI

How to use EmbedChain 📌

https://docs.embedchain.ai/get-started/quickstart

# conda create --name embedchain python=3.11
# conda activate embedchain
# !python -m venv embedchain_venv

#Unix
#!source embedchain_venv/bin/activate
#Windows
#.\embedchain_venv\Scripts\activate

#deactivate

#Get-ExecutionPolicy
#Set-ExecutionPolicy RemoteSigned
#Set-ExecutionPolicy Restricted

# !pip install embedchain
!pip show embedchain

LangChain

LangChain with Pandas DataFrame Agent
- https://python.langchain.com/v0.2/docs/integrations/toolkits/pandas/
- https://github.com/langchain-ai/langchain (MIT Licensed ❤️)
- The docs - https://python.langchain.com/v0.2/docs/introduction/

LangChain Examples ⏬

https://pypi.org/project/langchain/
Example Q&A About their (Python) docs - https://github.com/langchain-ai/chat-langchain

pip install langchain

It has very interesting features, like the prompt templates - https://python.langchain.com/v0.2/docs/concepts/#prompt-templates
With LangChain DF Agent (A PandasAI Alternative!)
- https://github.com/AssemblyAI/youtube-tutorials/blob/main/pandas-dataframe-agent/Interesting%20examples%20for%20Pandas%20Dataframe%20Agent.ipynb

🦜🔗 Build context-aware reasoning applications

LLamaIndex

LLamaIndex - https://github.com/run-llama/llama_index (MIT Licensed ❤️)
- https://docs.llamaindex.ai/en/stable/

LLamaIndex Examples ⏬

LlamaIndex is a data framework for your LLM applications

Sketch - https://github.com/approximatelabs/sketch (MIT Licensed)
- It allows to get DA tasks, like: questions / visualizations
- Also Data Cataloguing: like general tagging (PII Identification)

Using Sketch ⏬

You can also directly use a few pre-built hugging face models (right now MPT-7B and StarCoder), which will run entirely locally (once you download the model weights from HF).

Do this by setting environment 3 variables:

os.environ['LAMBDAPROMPT_BACKEND'] = 'StarCoder'
os.environ['SKETCH_USE_REMOTE_LAMBDAPROMPT'] = 'False'
os.environ['HF_ACCESS_TOKEN'] = 'your_hugging_face_token'

AI code-writing assistant that understands data content

DuckDB (vs) Pandas for Data Analysis - https://duckdb.org/
- https://ecoagi.ai/topics/DuckDB/duckdb-pandas

More about DuckDB ⏬

It’s designed to support analytical SQL queries and data analysis tasks. The DuckDB project also provides a Python library, allowing you to interact with DuckDB databases using Python.
DuckDB is multithreaded. It uses multiple threads to execute queries, which can lead to significant performance improvements on multicore systems.
DuckDB is generally faster than Pandas when it comes to executing SQL queries on large datasets.

pip install duckdb

DuckDB is an in-memory analytical database written in C++ (speed and efficiency)

Polars - https://github.com/pola-rs/polars

More about Polars ⏬

DuckDB and Polars are both tools used for data analysis, but they have some key differences. DuckDB is an in-memory analytical database that excels at executing SQL queries on large datasets.

Polars is a DataFrame library implemented in Rust and Python that is designed to be fast and flexible. While DuckDB focuses on providing SQL capabilities, Polars provides a DataFrame API that is similar to Pandas.

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

How to use AI (LLMs) Locally

Ollama
Oobabooga - Text Gen Web UI
GPT4All
PrivateGPT
GPT4All
KoboldCpp
LocalAI - Drop-in replacement for OpenAI running on consumer-grade hardware. No GPU required.

How to get Started with the LocalAI project ⏬

https://github.com/mudler/LocalAI/
- https://github.com/mudler/LocalAI/blob/master/docker-compose.yaml

Runs gguf, transformers, diffusers and many more models architectures. It allows to generate Text, Audio, Video, Images. Also with voice cloning capabilities.

https://localai.io/models/

Keep in mind models compatible with LocalAI must be quantized in the gguf format.

It will be available at: localhost:8080

http://192.168.3.235:8080/swagger/index.html
https://quay.io/repository/go-skynet/local-ai?tab=tags&tag=latest

version: '3.6'

services:
  api:
    # See https://localai.io/basics/getting_started/#container-images for
    # a list of available container images (or build your own with the provided Dockerfile)
    # Available images with CUDA, ROCm, SYCL
    # Image list (quay.io): https://quay.io/repository/go-skynet/local-ai?tab=tags
    # Image list (dockerhub): https://hub.docker.com/r/localai/localai
    image: quay.io/go-skynet/local-ai:master-ffmpeg-core
    build:
      context: .
      dockerfile: Dockerfile
      args:
      - IMAGE_TYPE=core
      - BASE_IMAGE=ubuntu:22.04
    ports:
      - 8080:8080
    env_file:
      - .env
    environment:
      - MODELS_PATH=/models
    #  - DEBUG=true
    volumes:
      - ./models:/models:cached
      - ./images/:/tmp/generated/images/
    command:
    # Here we can specify a list of models to run (see quickstart https://localai.io/basics/getting_started/#running-models )
    # or an URL pointing to a YAML configuration file, for example:
    # - https://gist.githubusercontent.com/mudler/ad601a0488b497b69ec549150d9edd18/raw/a8a8869ef1bb7e3830bf5c0bae29a0cce991ff8d/phi-2.yaml
    - phi-2

🤖 The free, Open Source OpenAI alternative. Self-hosted, community-driven and local-first, AnythingLLM!

AnythingLLM ⏬

The all-in-one Desktop & Docker AI application with full RAG and AI Agent capabilities.

And Multi-Agents

If one LLM can be helpful, how about combining them to work together?

MetaGPT
CrewAI - CrewAI is designed to enable AI agents to assume roles, share goals, and operate in a cohesive unit
AutoGen
ChatDev - https://github.com/OpenBMB/ChatDev

Try them together with LLMOps Tools like Pezzo AI or Agenta

How to get Better at Prompt Engineering

I was exploring prompt engineering already sometime ago here.

But lately discovered about the following resources:

🐙 Guides, papers, lecture, notebooks and resources for prompt engineering

Chat with different models with Streamlit [Multichat][AI] LangChain Explained - Providing context to LLMs