A powerful document analysis tool that uses Retrieval Augmented Generation (RAG) to provide intelligent answers to questions about your documents.
Chaitanya Vankadaru
AI/ML Engineer | Python Developer | Data Scientist
LinkedIn Profile
- 📄 PDF Document Processing: Advanced PDF parsing and text extraction
- 🔍 Smart Text Chunking: Intelligent document segmentation with customizable settings
- 🧠 Vector Embeddings: State-of-the-art embeddings using Sentence Transformers
- 💾 FAISS Vector Store: Fast and efficient similarity search
- 🤖 RAG Architecture: Enhanced question answering using document context
- 🎨 Modern UI: Clean, responsive interface with Streamlit
- 📊 System Statistics: Real-time performance metrics
- 🔄 Conversation History: Track and review Q&A interactions
- ⚙️ Customizable Settings: Adjust chunk size and overlap
- Python 3.12
- Streamlit (>=1.37.0)
- LangChain (>=0.2.5)
- FAISS-CPU (>=1.7.4)
- Sentence Transformers (>=2.2.2)
- OpenAI GPT (>=1.6.1)
- PyPDF (>=3.17.0)
- Python 3.12 or higher
- OpenAI API key
- Git (for version control)
- Virtual environment (recommended)
- Clone the repository:
git clone https://github.com/EarthlyAlien/Document-Assistant.git
cd Document-Assistant
- Create and activate a virtual environment:
# On Windows
python -m venv venv
.\venv\Scripts\activate
# On macOS/Linux
python -m venv venv
source venv/bin/activate
- Install dependencies:
# For production
pip install -r requirements.txt
# For development
pip install -r requirements-dev.txt
- Set up environment variables:
Create a
.env
file in the project root:
OPENAI_API_KEY=your_api_key_here
- Run the application:
streamlit run app.py
-
Document Upload
- Use the sidebar to upload PDF documents
- View uploaded document list
- Clear documents when needed
-
Configuration
- Adjust chunk size (default: 1000)
- Set chunk overlap (default: 200)
- Configure these based on document length and complexity
-
Processing
- Click "Process Document" to extract text and generate embeddings
- Monitor processing status in real-time
-
Question Answering
- Enter questions about your documents
- View AI-generated responses with source context
- Track conversation history
The Document Assistant uses a sophisticated RAG (Retrieval Augmented Generation) architecture:
-
Document Processing
- PDF parsing and text extraction
- Intelligent text chunking with overlap
- Clean text preprocessing
-
Vector Store
- Chunk embedding generation using Sentence Transformers
- FAISS vector index for efficient similarity search
- Persistent storage of embeddings
-
Question Answering
- Query embedding and semantic search
- Context retrieval from vector store
- LLM-powered answer generation with context
For development work:
- Install development dependencies:
pip install -r requirements-dev.txt
-
Development tools available:
- pytest (>=7.4.4): Testing framework
- pytest-cov (>=4.1.0): Code coverage
- flake8 (>=7.0.0): Code linting
- mypy (>=1.8.0): Static type checking
- black (>=24.2.0): Code formatting
-
Run tests:
# Run all tests
pytest
# Run with coverage report
pytest --cov=.
# Run with verbose output
pytest -v
- Code formatting:
# Format code
black .
# Check code style
flake8 .
# Type checking
mypy .
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
- Regular dependency updates
- Security vulnerability monitoring
- Safe API key handling
- Input validation and sanitization
This project is licensed under the MIT License - see the LICENSE file for details.
- Author: Chaitanya Vankadaru
- LinkedIn: Profile
- GitHub: EarthlyAlien
- OpenAI for GPT API
- Streamlit for the UI framework
- FAISS for vector similarity search
- Sentence Transformers for embeddings