Ask the Web with Streamlit and AI

Ask the Web with Streamlit and AI

October 17, 2024

But first, some recap on the old school.

ℹ️
A post of what I learnt about Scrapping Tools 💻

Old School Scrapping

Some time ago I was doing an interview and it was quite hard from them to see my CV.

But..if I always send pdf’s, whats wrong?

It seems that there are some HR parsing systems that can interprete data.

Sometimes breaking totally the initial format.

Lesson learnt.

A CV must be cool for the human eye, and understandable by machines.

But first, I want to know how many offers are out there.

For sure there is some seasonality. Lets just have a daily look and see how is the market

Or even better, lets make a script to do that.

ℹ️
And Applied it for better CV and job search

Is it a good moment to look for a Job?

Just have a look to how many offers are available now (and remote) vs the historical ones.

Using bs4 and requests to Get a feel on the Job Market - Total offers vs Remote offers📌

Within the CV Check Project at the folder ./Scrap_Pracuj

We are just pushing the data to a sqlite DB.

The data is extarcted with the known approach of beautiful soup. Where you need to input the Web structure.

⚠️
If the Web Structure Changes - Codes needs to be re-worked, as it happened here
How to explore the SQLiteDB 📌

After executing the script…

./run_pracuj.sh
#/home/reisipi/dirty_repositories/cv-check/Scrap_Pracuj/run_pracuj.sh

#just with python would do the same
#python3 pracuj_v3.py

we will have records:

sudo apt install sqlite3
sqlite3 --version

sqlite3 ./job_offers_v3.db
#sqlite3 /home/reisipi/dirty_repositories/cv-check/Scrap_Pracuj/job_offers_v3.db

#SELECT * FROM your_table_name ORDER BY your_primary_key_column DESC LIMIT 5;

#SELECT name FROM sqlite_master WHERE type='table';
#.tables

SELECT * FROM job_offers;
SELECT * FROM job_offers ORDER BY timestamp DESC LIMIT 5;

#.quit

You can make it run every night by setting CRON task with a script.

And after few days…this is how it looks like:

Job Offers Cron Result

Is it a good moment? Up to you.

Setup CRON job to execute python -> Bs4 -> SQLiteDB📌
nano run_pracuj.sh
chmod +x /home/reisipi/dirty_repositories/cv-check/Scrap_Pracuj/run_pracuj.sh
./run_pracuj.sh

crontab -e
#0 0 * * * /path/to/your/run_pracuj.sh >> /path/to/your/logfile.log 2>&1
0 23 * * * /home/reisipi/dirty_repositories/cv-check/Scrap_Pracuj/run_pracuj.sh
crontab -l
#python3 pracuj_v3.py >> /home/reisipi/dirty_repositories/cv-check/Scrap_Pracuj/script_output.log 2>&1

Look if you have space, still:

df -h | awk '$2 ~ /G/ && $2+0 > 3' #if you set logs, careful with the disk space (see drives >3GB)

Scrapping with AI

So, what can we do to make the code once, and scrap forever?

There are few options!

With these, you forget about inspecting web pages and look for the html tricks to make a systematic scrap.

ScrapeGraph

ScrapeGraph is a Free Python scraper based on AI

I was testing ScrapeGraph with Streamlit here

With ScrapeGraph, you just need an API for LLM and just ask questions about the content of a website!

ℹ️
ScrapeGraph allows for open Models - via Ollama and also closed LLMs to work with it.

But if you are looking for a quick thing…

…with OpenAI is really quick. Plus you already have the API plugged in for any other purpuse.

Like summarizing the resulting .json file of the scrapt or any other workflow.

This is what I made with this script - combining ScrapeGraph with OpenAI API Call to summarize

FireCrawl

FireCrawl makes it really easy to parse Web Info.

“Ive got the Key for Success”

ℹ️
I mean, FireCrawl needs an API to work (there is free Tier)
ℹ️
I used it for the DocPlanner Migration - With this repo and for WPMigration

Cool Thing to do With FireCrawl API

Get a page Info, scrap it, and more cool things with FireCrawl API 📌

It can be a companion for web-check.xyz and to know which links, pictures we have in a given page.

Very useful for web migrations.

Firecrawl can serve as a tool to see whats referenced on a page - As per the extracted linksonpage

It can Give you the content of a link directly in markdown - see the script. Which also summarizes it with OpenAI.

flowchart TD
    A[Start] --> B[Load environment variables]
    B --> C[Initialize OpenAI and Firecrawl API clients]
    C --> D{API keys loaded?}
    D -->|No| E[Raise EnvironmentError]
    D -->|Yes| F[Iterate through URLs]
    F --> G[Scrape URL and save data in multiple formats]
    G --> H{Scrape successful?}
    H -->|No| I[Skip to next URL]
    H -->|Yes| J[Save JSON, Markdown, Links, og:title]
    J --> K[Extract content from H1 matching og:title]
    K --> L{Content found?}
    L -->|No| M[Skip to next URL]
    L -->|Yes| N[Save filtered content to file]
    N --> O[Summarize filtered content using OpenAI]
    O --> P[Save summarized content to file]
    P --> Q[Move to next URL]
    Q --> F
    F --> R[End]
ℹ️
Now given an article/github repository - you can get a summary very fast and decide if its worth exploring further. Here you have such script.

Crawl4AI

See the Crawl4AI code

It offers a user-friendly interface and a range of features, including:

  • Ease of use: Crawl4AI is designed to be easy to use, even for those new to web scraping.

  • Fast performance: It is built for speed, outperforming many paid services.

  • LLM-friendly output: It produces output formats that are easy for LLMs to process, such as JSON and cleaned HTML.

  • Asynchronous support: It can crawl multiple URLs simultaneously, making it efficient for large-scale projects.

  • Media extraction: It can extract and return all media tags, including images, audio, and video.

  • Crawl4AI is available as a Python package and as a Docker image. It is a powerful tool for anyone who needs to extract data from the web for AI applications.

Star History Chart

Other Ways

FireCrawl is not giving me the juice of the offers, as seen during Scrap-Tools Tests

But… it can be done with 1) OpenAI+Pure parsed HTML

Using OpenAI API seems to be a reliable way when the web structure is not changing too much.

You could do similarly with other LLMs via their APIs

Star History Chart

And other library I saw, was 2) embedchain (now included in the mem0 repo)

With mem0/embedchain, we are going a little bit further than just scrapping.

The Memory layer for your AI apps

You can ofc use embedchain to ask questions about a web!

When you run the script, you will see that it is embedding content to a ChromaDB under the hood.

More about the mem0 project 📌

Mem0 is a memory layer that improves AI applications by enabling personalized user interactions through intelligent memory management.

This project addresses the need for AI systems that can remember user preferences and evolve over time, enhancing fields like customer support and personalized learning.

  • Key Features:

    1. Multi-level memory retention for users, sessions, and AI agents.
    2. Adaptive personalization that improves with user interactions.
    3. Developer-friendly API for seamless application integration.
    4. Cross-platform consistency for uniform behavior across devices.
    5. Managed service option for hassle-free hosting.
  • Pros:

    • Elevates user experience with tailored interactions.
    • Versatile support for various AI applications and use cases.
    • Simplifies setup and integration with existing systems.
  • Cons:

    • Requires a large language model (LLM), which may not suit all users.
    • Self-hosting may demand additional technical expertise.
  • Alternatives:

    • OpenAI’s memory management solutions.
    • Other AI memory frameworks like Rasa or Dialogflow.

Mem0 offers a promising solution for personalized AI interactions.

You will need mem0 API, or to plug one of your favourite LLMs, as per the docs to do other cool things with mem0.

Star History Chart


WebScrap with Streamlit

Time to create. Something.

WebScrap Features

  1. Get summarized web content
  2. Get Youtube Summaries - Enhanced PhiData project & my fork
  3. Get web search summarized - With DuckDuckgo as per PhiData

Deploying WebScrap

Cloudflare Tunnels + Cloudflare Access Control


Conclusions

ℹ️
Now you can try the app at:

What can we do now?

  1. Cool CV Stuff
  2. Understading Repositories Much better (and faster)
  3. Github Quick Summaries!
Tools to get Repo Information 📌

Tweaking a CV as per Offer Info

Its not lying.

Its having a base CV and some instructions for the AI to tweak few details so that it resonates more with an offer.

There are some AI Resume Builder out there - like rezi.ai

Reading CV Info

I tried with: Resume-Parser, Spacy (yes, the NER!), pyresparser and pdfminer.

ResumeParser & PDFminer, gave me the best and simplest results

Exploring Job Offers with AI

I tried with FireCrawl, but the juice of the offer is not captured

i can imagine is due to some robots.txt info that it is blocking it

But how about feeding pure HTML to OpenAI?

And Using Other Scrappers like: Crawl4AI or ScrapeCraph?

Lets Find out

Creating Cv with Code

There are few alternatives to create a curriculum with code

And I was testing it Already here

ℹ️
You can tweak it for other sites, like nl.indeed.com

Summarize Github Readmes

Summarize Github Readme (or actually any web).

And…create posts scheletons based on that info


Resources

Related Projects

  • Project YT2Doc - YouTube, Apple Podcast (and more) to readable Markdown.

  • Project YT2MD - Turn a YouTube video or playlist into Markdown file(s) to add to SSG site

Streamlit Related Stuff

How to use Streamlit with Containers

Docker-Compose for Streamlit 📌
version: '3'

services:
  streamlit-openaichatbot:
    image: youraiimage ##docker build -t youraiimage . OR WITH -> podman build -t youraiimage .
    container_name: youraiimage_bot
    volumes:
      - ai_aichatbot:/app
    working_dir: /app  # Set the working directory to /app
    command: /bin/sh -c "streamlit run streamlit_app.py"    
    #command: tail -f /dev/null #debug
    ports:
      - "8507:8501"    

volumes:
  ai_aichatbot:

How to Customize Streamlit Apps

Remove the default Streamlit Sections 📌
hide_st_style = """
            <style>
            #MainMenu {visibility: hidden;}
            footer {visibility: hidden;}
            header {visibility: hidden;}
            </style>
            """
st.markdown(hide_st_style, unsafe_allow_html=True)
How to add Simple Auth to Streamlit - 📌
How to customize Streamlit Meta Description📌