Skip to content

Latest commit

 

History

History
70 lines (37 loc) · 5.79 KB

README.md

File metadata and controls

70 lines (37 loc) · 5.79 KB

Categorizing Trends in Science Case Study

This repository contains code and resources for a case study on Categorizing Trends in Science based on their content. The case study aims to provide insights into the organization and categorization of scholarly articles across diverse fields of study.

Overview

Scholarly articles serve as the backbone of academic research, disseminating knowledge and advancements in various disciplines. However, navigating through a vast array of articles spanning different subjects can be daunting. In this case study, we delve into the world of scholarly articles to uncover patterns and themes hidden within the text data.

Key Objectives

  1. Understanding Scholarly Content: Gain insights into the content and themes present in scholarly articles across different fields.

  2. Organizing Knowledge: Develop a method for clustering articles based on their content to facilitate organization and retrieval.

  3. Exploring Interdisciplinary Connections: Identify interdisciplinary connections and relationships between different fields of study through clustering analysis.

Dataset

The dataset used in this case study comprises scholarly articles collected from various sources, covering a wide range of topics across different fields of study. Each article is represented by its title, abstract, and category. With more than 2.4 million rows, the dataset is extensive, providing a rich source of information for analysis and clustering. Click here to access the dataset on Kaggle.

Methodology

The case study follows a structured approach consisting of several key steps:

  1. Data Exploration: We begin by exploring the dataset, analyzing its dimensions, handling missing values, and visualizing the distribution of article categories. The dataset is extensive, comprising more than 2.4 million rows.

  2. Data Preprocessing: Text data preprocessing techniques are applied to clean and prepare the text data for analysis. This includes removing punctuation, stopwords, and performing lemmatization to standardize the text.

  3. TF-IDF Transformation: Text data is transformed into numerical vectors using the TF-IDF (Term Frequency-Inverse Document Frequency) representation, allowing us to quantify the importance of words in each document.

  4. Clustering Optimization: We explore different methods, such as the Elbow Method and Silhouette Score, to determine the optimal number of clusters for the dataset.

  5. Clustering: Clustering algorithms such as KMeans and MiniBatchKMeans are implemented to group similar articles together based on their content. We evaluate clustering performance using metrics like the Davies Bouldin Index.

  6. Dimensionality Reduction and Visualization: Dimensionality reduction techniques, such as TruncatedSVD, are employed to visualize the clustering results in lower-dimensional space. This allows us to gain insights into the distribution of articles across clusters.

Usage

To replicate the case study and explore the clustering of scholarly articles, follow these steps:

  1. Clone the repository:

    git clone https://github.com/TouradBaba/Trends_in_Science.git
  2. Install dependencies:

    pip install -r requirements.txt
  3. Run the Jupyter Notebooks with respect to their order.

Results and Insights

  1. Optimal Number of Clusters: Through the Elbow Method and Silhouette Score, we determined that the dataset is best represented by thirteen clusters. This number was chosen based on the optimal balance between intra-cluster cohesion and inter-cluster separation.

  2. Clustering Performance: We evaluated the performance of two clustering algorithms, KMeans and MiniBatchKMeans, using the Davies Bouldin Index. While both algorithms demonstrated reasonable clustering, KMeans outperformed MiniBatchKMeans slightly, as indicated by lower Davies Bouldin Index scores.

  3. Homogeneous Clusters: Each of the thirteen clusters identified in our analysis exhibited homogeneity in terms of the categories of articles they contained. For example, Cluster 0 predominantly consisted of articles related to mathematics, while Cluster 1 focused on topics in astrophysics and computer science. This homogeneity within clusters underscores the effectiveness of the clustering approach in grouping similar articles together.

  4. Dimensionality Reduction and Visualization: We employed TruncatedSVD to reduce the dimensionality of the dataset, allowing us to visualize the clustering results in two and three dimensions. These visualizations provided valuable insights into the distribution of data points and the separation of clusters in reduced dimensional space.

  5. Interpretation of Clusters: The clusters generated by the analysis represent distinct themes or topics within the dataset. By examining the articles contained in each cluster, we gained a deeper understanding of the underlying structure and content of the scholarly articles, enabling us to discern patterns and relationships between different fields of study.

  6. Creation of Auxiliary Resource: Additionally, we created a JSON file mapping category aliases to their corresponding names, facilitating better understanding and interpretation of the categories within the dataset.

In conclusion, The analysis has shed light on the structure and content of the dataset, facilitating the identification and exploration of distinct themes and topics. The clustering results obtained from the analysis provide a framework for organizing and categorizing scholarly articles, thereby contributing to the broader understanding of academic research across diverse fields. Moving forward, these insights can be leveraged for various applications, including recommendation systems, topic modeling, and academic research management.