This project asynchronously scrapes web content, generates semantic text chunks using sentence embeddings, and stores them in a Milvus vector database for efficient similarity search. Built with Python, Langchain, SentenceTransformers, and Milvus for scalable vector-based retrieval.
This project scrapes content from multiple websites asynchronously, tokenizes and embeds the content into semantic chunks using Sentence Transformers, and stores them in a Milvus vector database for efficient similarity search and retrieval.
- Python 3.9
- aiohttp
- nltk
- pandas
- sentence-transformers
- pymilvus
- langchain
- scikit-learn
- numpy
- Asynchronous web scraping with
aiohttp
andlangchain
- Semantic chunking using NLTK sentence tokenization
- Embedding with
sentence-transformers/all-MiniLM-L6-v2
- Vector similarity search using Milvus
- Dockerized setup with Milvus, MinIO, Etcd, and Python environment
1. git clone https://github.com/yourusername/semantic-web-milvus.git
cd semantic-web-milvus
2. Start Docker Services
docker-compose up --build -d
This will spin up:
Milvus vector database
Etcd (metadata service)
MinIO (object storage)
Python container (milvus-python) with all dependencies pre-installed
3. Access Python Container
docker exec -it milvus-python bash
4. Run your main script:
python your_script.py
βββ docker-compose.yml
βββ Dockerfile.python
βββ scripts/
β βββ your_script.py
βββ volumes/
β βββ etcd/
β βββ milvus/
β βββ minio/
Building search engines over scraped web content
Knowledge base construction with semantic search
Content recommendation systems
π Author
LinkedIn: http://www.linkedin.com/in/SwapnilTaware
GitHub: https://github.com/itsSwapnil
This project is licensed under the MIT License.