Streamlit MigrAItion - Making cool Webs from old ones

Blogs

November 14, 2024

For the Real Estate and few Wordpress projects, my clients asked me to migrate their existing content.

It’s always great to having a look to the initial pages to know what to expect from them.

Is it everything working?

Hows the initial Performance of the Site?

Checking that a Website works - or in this case, what we will have to Migrate 📌

WebCheck -
Gte all url/pages of a website - A must to migrate everything!
Are there any broken links?

#podman run --rm -it ghcr.io/linkchecker/linkchecker:latest --verbose https://fossengineer.com > linkchecker_output.txt

docker run --rm -it -u $(id -u):$(id -g) ghcr.io/linkchecker/linkchecker:latest --verbose https://www.jmodels.net

pip3 install linkchecker

Check SiteMap/Robots.txt of a Web 📌

#curl -s https://example.com/sitemap.xml -o /dev/null -w "%{http_code}\n"
curl -s https://jmodels.net/sitemap.xml -o /dev/null -w "%{http_code}\n" #hugo paper mod has it

#optional - check robots.txt
curl -s https://jmodels.net/robots.txt | grep -i sitemap #look for sitemap direction
curl -s https://jmodels.net/robots.txt | head -n 10 #see the first 10 lines

I was expecting to just use the latest AI Scrapping tools that I learnt here…

…but Python+BS4 can do very interesting tricks.

ℹ️

It was very helpful to have thes Scraping Tools Repo and to have this related post 💻

Lets have a look.

For the Real Estate Project

Getting All the Images from the Web to Migrate [BS4] 📌

You can do the following:

Explore the website content
Find the html section where all the photo links are stored
Use LLMs so that they give you a Python Code to: Find the photo links + Download them in a folder
Repeat for every link you have (leverage the previous functions)
[Optional] Include the folder to the .gitignore

This is what I made with this sample code here, just using BS4

ℹ️

See how the Real Estate Web Project was created from Scratch with Astro.

A Wordpress Migration

ℹ️

See how these Themes were proposed as alternative to a Wordpress Web.

Inspect SiteMap with Python

I created this Python Script to Inspect (and save) the SiteMaps

Use AI Generated Script to get all URLs

I used chatgpt to create the following…

Another Py scripts that saves as .csv all the different urls Links as per the SiteMap

For the future, you can do these 2 steps with ONE SCRIPT

Always good to check if there are broken links referenced in each original posts - For correction.

Scrap & Get all info (as it is of one Link)

For Scrapping, there are few options as we saw: Crawl4AI, ScrapeGraph, FireCrawlAI…

But we can have a custom tool with BS4.

Leverage Time - Get info…of ALL Links

In this case, we had to go through ~1k different urls.

That was a good reason to go with a HUGO Theme as well. Fuwari Astro was also an alternative.

Big thanks yo CaiJimmy/hugo-theme-stack-starter.

Each post has 2 different languages.

Custom BS4 Tool

Making sure that all info is there by knowing the web categories

Category 1 Posts

Example - De hombres y maquinas https://jmodels.net/2020/08/10/de-hombres-maquinas-554-ilyushin-db-3/

See how the html is structured

In this case, they are under <div class="content clear fleft" id="content">

Create a BS4 scrpt to get the content

Here, it redirects to this link - https://jmodels.net/de-hombres-y-maquinas/aire-air/ilyushin-db-3/
It was not flagged in the sitexml, but checking the robots…there is another sitemap https://jmodels.net/news-sitemap.xml

curl -s https://jmodels.net/robots.txt | grep -i sitemap #look for sitemap direction

And its also not there…TBC!!!

Inspecting a URL with content - https://jmodels.net/de-hombres-y-maquinas/aire-air/ilyushin-db-3/

Inspecting it, we have content at <div class="post-content clear">

Plug the scrapped content to an LLM to get the markdown, while preserving the initial structure

This script takes the info from the given div, and feeds it to OpenAI API to get .md

If you already have a theme selected, you ar every close to have one post migrated

My [AI] Youtube Workflow with Whisper How Ive used Github Actions CI/CD