Skip to content
/ SciQu Public

This repository contains the script for building the LLM driven scientific data extraction tool

License

Notifications You must be signed in to change notification settings

ABnano/SciQu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

SciQu - Automated Data Mining and ML Training Integration

Project Overview

SciQu is an innovative tool designed to streamline the literature review process by automating data extraction and query handling from PDF files. Built using Streamlit for the user interface and various LangChain components for backend processing, SciQu facilitates efficient and accurate retrieval of information from scientific documents.

Sci_Qu

Key Features

  1. PDF Upload and Processing: Users can upload PDF files, which are processed using the UnstructuredPDFLoader to extract text.
  2. Text Chunking: The extracted text is split into manageable chunks using RecursiveCharacterTextSplitter, with a chunk size of 700 and an overlap of 100.
  3. Embedding and Storage: Chunks are embedded using OllamaEmbeddings and stored in a Chroma vector database.
  4. Dynamic Query Handling: Users can query the contents of the uploaded documents through a text input field.
  5. Multi-Perspective Retrieval: Queries are processed using a MultiQueryRetriever, generating multiple perspectives to enhance retrieval accuracy.
  6. Contextual Response Generation: Retrieved contexts are passed to a ChatOllama model to generate responses, which are displayed to the user.
  7. Session History Tracking: Query-answer pairs are saved in the session state for history tracking.
  8. ML Training Integration: Demonstrates the use of machine learning for predicting material properties.

Project Structure

1. SciQu for Automated Data Mining

Steps:

  1. Upload PDF Files: Users can upload PDF files through the Streamlit interface.
  2. Text Extraction: Uploaded PDFs are processed using UnstructuredPDFLoader to extract text.
  3. Text Chunking: The extracted text is split into chunks of 700 characters with a 100-character overlap using RecursiveCharacterTextSplitter.
  4. Embedding: The text chunks are embedded using OllamaEmbeddings.
  5. Storage: Embedded chunks are stored in a Chroma vector database.
  6. Query Input: Users input queries through a text field.
  7. MultiQuery Retrieval: Queries are processed using MultiQueryRetriever to generate multiple perspectives.
  8. Response Generation: Contexts retrieved are passed to a ChatOllama model to generate responses.
  9. Session State: Query-answer pairs are stored for session history tracking.

2. Integration of ML Training with SciQu

Dataset:

  • Materials: 20 materials and their properties are used as input descriptors for predicting the refractive index. The materials include K2Te, K2O, BaS, Na2Te, SnSe, CaS, MgS, CdI2, CdBr2, YN, HgF2, SnO, BN, PtO2, K2S, BeS, MgI2, RbBr, VCl2, Na2S.

Steps:

  1. Library Installation: Install necessary libraries.
  2. Dataset Loading: Load the dataset containing materials and their properties.
  3. Attribute Extraction: Extract selected attributes, including refractive index, band gap, ferroelectricity, etc.
  4. Data Preprocessing: Check the dataset for any missing values.
  5. Feature Selection: Define input features (X) and the target variable (y), selecting relevant columns.
  6. Data Splitting: Split the data into training and testing sets (70-30 split).
  7. Model Training: Create and train a Random Forest Regressor model with 100 estimators on the training data.
  8. Model Evaluation: Make predictions on the test set and evaluate the model's performance using RMSE and R-squared score.
  9. Visualization: Generate regression and residual plots using Seaborn to visualize model performance.

Installation

To set up and run the SciQu tool, follow these steps:

  1. Clone the repository:

    git clone https://github.com/yourusername/sciqu.git
    cd sciqu
  2. Create a virtual environment and activate it:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install the required libraries:

    pip install -r requirements.txt
  4. Run the application:

    streamlit run app.py

Usage

  1. Upload a PDF: Use the file uploader to select a PDF document.
  2. Query the Document: Enter your query in the text input field and submit.
  3. View Responses: The response generated by the ChatOllama model will be displayed, and the query-answer pairs will be saved in the session history.
  4. ML Training: Follow the provided steps to train the ML model using the sample dataset.

Contributions

Contributions are welcome! Please submit a pull request or open an issue to discuss any changes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

Special thanks to the Prof. Dipankar Mandal for their discussion.


About

This repository contains the script for building the LLM driven scientific data extraction tool

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages