Document Topic Modeling Framework

A comprehensive framework for unsupervised topic modeling and document classification using transformer-based embeddings. This project includes both the research notebooks for model development and a complete interactive web application.

Project Overview

This project combines modern natural language processing techniques to discover latent topics in text documents without manual labeling:

Transformer Embeddings: Leverage RoBERTa to generate high-quality document representations
Dimensionality Reduction: Apply UMAP to preserve semantic relationships while reducing dimensions
Clustering: Implement K-means clustering to discover natural topic groups
Interactive Visualization: Explore document-topic relationships through interactive visualizations

The implementation is provided both as research notebooks and as a production-ready web application.

Repository Structure

└─── Notebooks/
│    │    Advanced_Topic_Modeling.ipynb
│    │    Transformers_Embedding_Benchmark.ipynb
└─── application
│    └─── static/
│    └─── templates/
│    └─── topic_model_export/
│    │    app.py
│    │    requirements.txt
│    README.md

Research Notebooks

Advanced_Topic_Modeling.ipynb

This main notebook implements a comprehensive framework for unsupervised topic modeling using transformer-based embeddings, tested on the BBC News Dataset (2,225 documents across 5 categories):

Data Understanding & Preprocessing: Thorough exploration of the BBC News dataset including category distribution, document length analysis, and word frequency analysis.
Embedding Pipeline: Implementation of RoBERTa embeddings for document representation, with methods to handle document length constraints.
Dimensionality Reduction: Comparison of PCA, t-SNE, and UMAP techniques, with UMAP providing the best performance.
Clustering Implementation: K-means and hierarchical clustering algorithms with evaluation metrics like ARI, homogeneity, and silhouette scores.
Topic Extraction & Labeling: Methods to automatically extract distinctive terms and generate meaningful topic labels for each cluster.
Visualization Tools: Creation of various visualizations including topic proximity maps, document-topic distributions, and interactive visualizations.

Transformers_Embedding_Benchmark.ipynb

This notebook focuses on benchmarking different transformer models and embedding techniques to identify the optimal configuration:

Model Comparison: Evaluation of 5 transformer models (BERT, RoBERTa, DistilBERT, MPNet, and ALBERT) with the same BBC News dataset.
Embedding Method Assessment: Comparison of two embedding approaches - CLS token (using the classification token) and Mean token (averaging all token embeddings).
Performance Metrics: Thorough evaluation using clustering quality metrics (ARI, Homogeneity, V-measure, Silhouette score) and processing efficiency.
Results Analysis: Detailed visualizations and comparisons of model performance, with recommendations for the optimal configuration.
Recommendations: Clear conclusions about which model and embedding method perform best for topic modeling tasks.

Web Application

The /application folder contains a production-ready Flask web application that implements the topic modeling framework with an intuitive user interface.

Application Overview

This application uses advanced natural language processing techniques to automatically classify text documents into topics without requiring any manual labeling. The system combines several state-of-the-art machine learning components:

Technical Architecture

RoBERTa Embeddings: A state-of-the-art transformer model that converts text into high-dimensional semantic representations (768 dimensions)
UMAP Dimensionality Reduction: Reduces the high-dimensional embeddings to a manageable space while preserving important semantic relationships
K-means Clustering: Groups documents with similar meanings into distinct topic

Classification Process

Document Embedding: When you input a document, the system generates a RoBERTa embedding that captures its semantic meaning
Dimensional Projection: This embedding is then projected into the same space as pre-trained topic clusters using UMAP
Proximity Analysis: The system calculates the proximity to each topic cluster and assigns confidence scores
Topic Assignment: The document is classified into the most likely topic, with confidence scores shown for all categories
Key Term Extraction: Distinctive terms from the document are identified and displayed to explain classification reasoning

Features

Text Input Options: Paste text directly or select from example documents
Real-time Classification: Process documents instantly and see results without page refresh
Confidence Analysis: View probability distribution across all potential topics
Document-specific Key Terms: Extract and highlight distinctive terms from the input document
Interactive Visualization: Explore how the document positions in relation to topic clusters in 2D space
Responsive Design: Works seamlessly on desktop and mobile devices

Screenshots

Getting Started

Prerequisites

Python 3.7+
Required libraries listed in requirements.txt

Installation

Clone the repository:

git clone https://github.com/imanerh/Topic-Modeling.git
cd Topic-Modeling

Install required packages:

pip install -r requirements.txt

Running the Web Application

Navigate to the application directory:

cd application

Run the Flask app:

python app.py

Open your browser and go to:

http://localhost:5000

Results

The framework achieves excellent topic separation on the BBC News dataset:

High Clustering Metrics: 0.85+ homogeneity and V-measure scores
Strong Category Alignment: 90%+ alignment with ground truth categories
Interpretable Topics: Clear and distinctive terms characterizing each topic

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Notebooks		Notebooks
application		application
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Topic Modeling Framework

Project Overview

Repository Structure

Research Notebooks

Advanced_Topic_Modeling.ipynb

Transformers_Embedding_Benchmark.ipynb

Web Application

Application Overview

Technical Architecture

Classification Process

Features

Screenshots

Getting Started

Prerequisites

Installation

Running the Web Application

Results

About

Uh oh!

Releases

Packages

Languages

imanerh/Topic-Modeling

Folders and files

Latest commit

History

Repository files navigation

Document Topic Modeling Framework

Project Overview

Repository Structure

Research Notebooks

Advanced_Topic_Modeling.ipynb

Transformers_Embedding_Benchmark.ipynb

Web Application

Application Overview

Technical Architecture

Classification Process

Features

Screenshots

Getting Started

Prerequisites

Installation

Running the Web Application

Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages