Testing AI Scrapping projects: GPT-Crawler...

Testing AI Scrapping projects: GPT-Crawler...

December 7, 2024

Leverage your Research with AI Scrap Tools - Repo Reader Setup.

There are few AI Scrapping Tools out there already.

Lets have a look to repo reader

More AI Powered Scrapping

Star History Chart

Crawl4AI

Crawl4AI

ScrapeGraph

ScrapeGraph

FireCrawl


Conclusions

There will always be space for beautiful soup.

But we have to recognize the power of these AI powered scrappers.

Star History Chart

RAG / Agentic Frameworks

GPT-CRAWLER

git clone https://github.com/builderio/gpt-crawler
cd gpt-crawler

#node -v
npm i #install dependencies

#modify the config.ts
bun start #npm start
import { Config } from "./src/config";


export const defaultConfig: Config = {
  url: "https://iotechcrafts.com",
  match: "https://iotechcrafts.com/blog/**",
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
  maxTokens: 2000000,
};

// export const defaultConfig: Config = {
//   url: "https://www.builder.io/c/docs/developers",
//   match: "https://www.builder.io/c/docs/**",
//   maxPagesToCrawl: 50,
//   outputFileName: "output.json",
//   maxTokens: 2000000,
// };

https://chat.openai.com/gpts

Upload the output.json

Create the GPT with: Reads through my website data

Ask questions: what do you know about iotechcrafts?

  • I want to build a chat bot for documentation
  • Upload json file generated by the crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL

Yes, the ISC (Internet Systems Consortium) license is an open source license. It is a permissive license that is similar in terms to the MIT and BSD licenses, allowing almost unrestricted freedom to use, modify, and distribute the software. The ISC license is recognized by the Open Source Initiative (OSI) as a standards-compliant open source license.

Key features of the ISC license include:

Simplicity: It's a very brief license, easy to understand, and straightforward in its permissions and limitations.
Permissiveness: It allows for commercial use, modification, distribution, and private use of the software.
Minimal Requirements: The only significant requirement is to include the copyright notice and the license itself with any copies of the software or substantial portions of it.

Because of its simplicity and permissiveness, the ISC license is favored for projects that wish to impose minimal restrictions on the use and distribution of their software, promoting open and free use of the code.


Conclusions

See also:

  1. LangChain Web Scrapping

  2. Browserless

Deploy headless browsers in Docker. Run on our cloud or bring your own. Free for non-commercial uses.


FAQ

Interesting Ways to add Memory to LLms

PandasAI

Playing with Vector DBs

Vector Admin

GUI for vector DB’s like: Qdrant, ChromaDB or Pinecone

The universal tool suite for vector database management. Manage Pinecone, Chroma, Qdrant, Weaviate and more vector databases with ease.

https://github.com/Mintplex-Labs/vector-admin

Docker

https://github.com/Mintplex-Labs/vector-admin/blob/master/docker/DOCKER.md

git clone git@github.com:Mintplex-Labs/vector-admin.git ./vector-admin
cd vector-admin
cd docker
cp .env.example .env. #and adjust

#5432 will be ok as it is in the same stack

JWT_SECRET="some-random-string"
SYS_EMAIL="root@vectoradmin.com"
SYS_PASSWORD="password"
DATABASE_CONNECTION_STRING="postgresql://vectoradmin:password@postgres:5432/vdbms" # Valid PG Connection string.
INNGEST_SIGNING_KEY="some-random-string"

For external container:

sudo docker-compose up -d --build vector-admin

Go to: localhost:3001

For your first login will require you to use the SYS_EMAIL and SYS_PASSWORD set via ENV during build or run. After onboarding this login will be permanently disabled.

Try and connect to qDrant with: http://192.168.3.103:6333

And to Chroma:

Chroma running locally When trying to connect to a Chroma instance running also on the same machine use http://host.docker.internal:[CHROMA_PORT] as the URL to connect with.

Using: http://localhost:8001

If you stored something, it will be at: http://localhost:8001/api/v1/collections


F\OSS Vector DBs

MINDSDB

can work with Ollama

https://github.com/mindsdb/mindsdb https://github.com/mindsdb/mindsdb/blob/staging/LICENSE https://hub.docker.com/u/mindsdb

https://docs.mindsdb.com/what-is-mindsdb https://pypi.org/project/MindsDB/

https://github.com/mindsdb/mindsdb/blob/staging/mindsdb/integrations/handlers/ollama_handler/README.md

QDRANT

docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrant
version: '3'
services:
  qdrant:
    container_name: my_qdrant_container
    image: qdrant/qdrant
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/path/to/qdrant_data

volumes:
  qdrant_data:

Check its UI at:

http://localhost:6333/dashboard#

ChromaDb

docker pull chromadb/chroma
docker run -p 8001:8000 chromadb/chroma
version: '3.9'

services:
  chroma:
    container_name: chroma-container
    image: chromadb/chroma
    ports:
      - "8001:8000"
    volumes:
      - chroma_data:/chroma/chroma

volumes:
  chroma_data:
    driver: local

Then, just go to: http://localhost:8001 and http://localhost:8001/api/v1

To check the heartbeat and then you are good to go with ChromaDB.


FAQ

How to Build ChromaDB Container?

git clone git@github.com:chroma-core/chroma.git
cd chroma
docker-compose up -d --build #https://raw.githubusercontent.com/chroma-core/chroma/main/docker-compose.yml