Build Vector DB and RAG(Will be updated)

Building a Vector Database of Research Papers from SMILES Strings

This idea originated while I was working with a dataset from the UCI Machine Learning Repository: Drug-Induced Autoimmunity Prediction Dataset

The goal of this project is to build a custom knowledge base from medical literature for any given molecule, using its SMILES (Simplified Molecular Input Line Entry System) representation. Here's how the process works:

Search and Download Research Papers Given a SMILES string (or a related keyword), the program automatically searches for and downloads all freely available research papers related to that molecule or term. These PDFs are saved locally in a directory called Research_papers.
Convert Papers to Embeddings The downloaded PDFs are processed and converted into text embeddings. These embeddings are then uploaded to a Pinecone vector database, enabling efficient similarity-based search and retrieval.
Clean-Up Once the embeddings are stored in Pinecone, the local PDF files are deleted to save disk space. This ensures the system doesn't become overloaded with files.
Querying the Database (RAG System) The final component allows users to query the Pinecone index. The system retrieves relevant information and provides citations from the indexed papers. If a query doesn't match any data in the index, it will simply inform you that there’s no relevant information available.

This is a basic Retrieval-Augmented Generation (RAG) setup. While functional, it’s not highly optimized—it can be slow. For larger-scale applications, the code would need to be parallelized for better performance.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
research_papers		research_papers
.DS_Store		.DS_Store
.gitignore		.gitignore
DIA_trainingset_RDKit_descriptors.csv		DIA_trainingset_RDKit_descriptors.csv
LICENSE		LICENSE
README.md		README.md
cleaned_notebook.ipynb		cleaned_notebook.ipynb
demo_script.py		demo_script.py
down_pdf.sh		down_pdf.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Build Vector DB and RAG(Will be updated)

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

btarun13/Search_Vector_DB_RAG

Folders and files

Latest commit

History

Repository files navigation

Build Vector DB and RAG(Will be updated)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages