Testing AI Scrapping projects: GPT-Crawler...
Leverage your Research with AI Scrap Tools - Repo Reader Setup.
There are few AI Scrapping Tools out there already.
Lets have a look to repo reader
More AI Powered Scrapping
Crawl4AI
ScrapeGraph
FireCrawl
Conclusions
There will always be space for beautiful soup.
But we have to recognize the power of these AI powered scrappers.
RAG / Agentic Frameworks
GPT-CRAWLER
- https://github.com/BuilderIO/gpt-crawler?tab=ISC-1-ov-file#readme
- https://www.youtube.com/watch?v=0wJ1rgvUQkE
git clone https://github.com/builderio/gpt-crawler
cd gpt-crawler
#node -v
npm i #install dependencies
#modify the config.ts
bun start #npm start
import { Config } from "./src/config";
export const defaultConfig: Config = {
url: "https://iotechcrafts.com",
match: "https://iotechcrafts.com/blog/**",
maxPagesToCrawl: 50,
outputFileName: "output.json",
maxTokens: 2000000,
};
// export const defaultConfig: Config = {
// url: "https://www.builder.io/c/docs/developers",
// match: "https://www.builder.io/c/docs/**",
// maxPagesToCrawl: 50,
// outputFileName: "output.json",
// maxTokens: 2000000,
// };
https://chat.openai.com/gpts
Upload the output.json
Create the GPT with: Reads through my website data
Ask questions: what do you know about iotechcrafts?
- I want to build a chat bot for documentation
- Upload json file generated by the crawler
Crawl a site to generate knowledge files to create your own custom GPT from a URL
Yes, the ISC (Internet Systems Consortium) license is an open source license. It is a permissive license that is similar in terms to the MIT and BSD licenses, allowing almost unrestricted freedom to use, modify, and distribute the software. The ISC license is recognized by the Open Source Initiative (OSI) as a standards-compliant open source license.
Key features of the ISC license include:
Simplicity: It's a very brief license, easy to understand, and straightforward in its permissions and limitations.
Permissiveness: It allows for commercial use, modification, distribution, and private use of the software.
Minimal Requirements: The only significant requirement is to include the copyright notice and the license itself with any copies of the software or substantial portions of it.
Because of its simplicity and permissiveness, the ISC license is favored for projects that wish to impose minimal restrictions on the use and distribution of their software, promoting open and free use of the code.
Conclusions
See also:
LangChain Web Scrapping
Browserless
Deploy headless browsers in Docker. Run on our cloud or bring your own. Free for non-commercial uses.
FAQ
Interesting Ways to add Memory to LLms
PandasAI
- https://pypi.org/project/pandasai/
- https://docs.pandas-ai.com/en/latest/
- https://github.com/gventuri/pandas-ai
Playing with Vector DBs
Vector Admin
GUI for vector DB’s like: Qdrant, ChromaDB or Pinecone
The universal tool suite for vector database management. Manage Pinecone, Chroma, Qdrant, Weaviate and more vector databases with ease.
https://github.com/Mintplex-Labs/vector-admin
Docker
https://github.com/Mintplex-Labs/vector-admin/blob/master/docker/DOCKER.md
git clone git@github.com:Mintplex-Labs/vector-admin.git ./vector-admin
cd vector-admin
cd docker
cp .env.example .env. #and adjust
#5432 will be ok as it is in the same stack
JWT_SECRET="some-random-string"
SYS_EMAIL="root@vectoradmin.com"
SYS_PASSWORD="password"
DATABASE_CONNECTION_STRING="postgresql://vectoradmin:password@postgres:5432/vdbms" # Valid PG Connection string.
INNGEST_SIGNING_KEY="some-random-string"
For external container:
sudo docker-compose up -d --build vector-admin
Go to: localhost:3001
For your first login will require you to use the SYS_EMAIL
and SYS_PASSWORD
set via ENV during build or run. After onboarding this login will be permanently disabled.
Try and connect to qDrant with: http://192.168.3.103:6333
And to Chroma:
Chroma running locally When trying to connect to a Chroma instance running also on the same machine use http://host.docker.internal:[CHROMA_PORT] as the URL to connect with.
Using: http://localhost:8001
If you stored something, it will be at: http://localhost:8001/api/v1/collections
F\OSS Vector DBs
MINDSDB
can work with Ollama
https://github.com/mindsdb/mindsdb https://github.com/mindsdb/mindsdb/blob/staging/LICENSE https://hub.docker.com/u/mindsdb
https://docs.mindsdb.com/what-is-mindsdb https://pypi.org/project/MindsDB/
QDRANT
docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrant
version: '3'
services:
qdrant:
container_name: my_qdrant_container
image: qdrant/qdrant
ports:
- "6333:6333"
volumes:
- qdrant_data:/path/to/qdrant_data
volumes:
qdrant_data:
Check its UI at:
http://localhost:6333/dashboard#
ChromaDb
docker pull chromadb/chroma
docker run -p 8001:8000 chromadb/chroma
version: '3.9'
services:
chroma:
container_name: chroma-container
image: chromadb/chroma
ports:
- "8001:8000"
volumes:
- chroma_data:/chroma/chroma
volumes:
chroma_data:
driver: local
Then, just go to: http://localhost:8001 and http://localhost:8001/api/v1
To check the heartbeat and then you are good to go with ChromaDB.
FAQ
How to Build ChromaDB Container?
git clone git@github.com:chroma-core/chroma.git
cd chroma
docker-compose up -d --build #https://raw.githubusercontent.com/chroma-core/chroma/main/docker-compose.yml