Streamlit MigrAItion - Making cool Webs from old ones

Streamlit MigrAItion - Making cool Webs from old ones

November 14, 2024

It’s always great to having a look to the initial pages to know what to expect from them.

Is it everything working?

Hows the initial Performance of the Site?

Checking that a Website works - or in this case, what we will have to Migrate 📌
  • WebCheck -
  • Gte all url/pages of a website - A must to migrate everything!
  • Are there any broken links?
#podman run --rm -it ghcr.io/linkchecker/linkchecker:latest --verbose https://fossengineer.com > linkchecker_output.txt

docker run --rm -it -u $(id -u):$(id -g) ghcr.io/linkchecker/linkchecker:latest --verbose https://www.jmodels.net

pip3 install linkchecker
Check SiteMap/Robots.txt of a Web 📌
#curl -s https://example.com/sitemap.xml -o /dev/null -w "%{http_code}\n"
curl -s https://jmodels.net/sitemap.xml -o /dev/null -w "%{http_code}\n" #hugo paper mod has it

#optional - check robots.txt
curl -s https://jmodels.net/robots.txt | grep -i sitemap #look for sitemap direction
curl -s https://jmodels.net/robots.txt | head -n 10 #see the first 10 lines

I was expecting to just use the latest AI Scrapping tools that I learnt here

…but Python+BS4 can do very interesting tricks.

ℹ️
It was very helpful to have thes Scraping Tools Repo and to have this related post 💻

Lets have a look.

For the Real Estate Project

Getting All the Images from the Web to Migrate [BS4] 📌

You can do the following:

  1. explore the website content
  2. find the html section where all the photo links are stored
  3. Use LLMs so that they give you a Python Code to: Find the photo links + Download them in a folder
  4. Repeat for every link you have (leverage the previous functions)
  5. [Optional] Include the folder to the .gitignore

This is what I made with this sample code here, just using BS4

ℹ️
See how the Real Estate Web Project was created from Scratch with Astro.

A Wordpress Migration

ℹ️

Inspect SiteMap with Python

I created this Python Script to Inspect (and save) the SiteMaps

Use AI Generated Script to get all URLs

I used chatgpt to create the following…

Another Py scripts that saves as .csv all the different urls Links as per the SiteMap

For the future, you can do these 2 steps with ONE SCRIPT

Always good to check if there are broken links referenced in each original posts - For correction.

Scrap & Get all info (as it is of one Link)

For Scrapping, there are few options as we saw: Crawl4AI, ScrapeGraph, FireCrawlAI…

But we can have a custom tool with BS4.

Leverage Time - Get info…of ALL Links

In this case, we had to go through ~1k different urls.

That was a good reason to go with a HUGO Theme as well. Fuwari Astro was also an alternative.

Big thanks yo CaiJimmy/hugo-theme-stack-starter.

Each post has 2 different languages.

Custom BS4 Tool

Making sure that all info is there by knowing the web categories

Category 1 Posts

Example - De hombres y maquinas https://jmodels.net/2020/08/10/de-hombres-maquinas-554-ilyushin-db-3/

  1. See how the html is structured
  • In this case, they are under <div class="content clear fleft" id="content">
  1. Create a BS4 scrpt to get the content
curl -s https://jmodels.net/robots.txt | grep -i sitemap #look for sitemap direction
  • And its also not there…TBC!!!
  1. Inspecting a URL with content - https://jmodels.net/de-hombres-y-maquinas/aire-air/ilyushin-db-3/
  • Inspecting it, we have content at <div class="post-content clear">
  1. Plug the scrapped content to an LLM to get the markdown, while preserving the initial structure
  1. If you already have a theme selected, you ar every close to have one post migrated

FAQ