Classification of Scientific Article Abstracts on Cora Dataset Using Long Short Term Memory

This repository contains the implementation and results of my undergraduate thesis research, titled "Klasifikasi Abstrak Artikel Ilmiah pada Dataset Cora Menggunakan Long Short Term Memory". The research aims to classify scientific articles based on their abstracts using the Long Short Term Memory (LSTM) model and FastText word embedding.

Introduction

In this study, text classification was conducted on the Cora dataset using the LSTM model. The dataset contains scientific article abstracts grouped into seven topics:

Neural Networks
Probabilistic Methods
Genetic Algorithms
Theory
Case Based
Reinforcement Learning
Rule Learning

By utilizing text mining techniques, this research focuses on improving classification accuracy using FastText embeddings and optimizing hyperparameters.

Dataset

The Cora Dataset was used, which consists of 2,708 scientific articles. Each article is represented by its abstract and associated topic label. For this research, pre-processing steps such as cleaning, stemming, stopword removal, and tokenization were applied to prepare the data.

Dataset Source

Andrew McCallum's Cora Dataset

Key Features of the Dataset

Number of articles: 2,708
Number of unique words: 16,155 (reduced to 1,433 after filtering)
Seven predefined classes/topics

Research Stages

Data Collection: Combining the content file and abstract data to form a unified dataset.
Pre-processing:
- Cleaning text by removing punctuation and case folding.
- Applying stemming and stopword removal.
- Filtering unique words to match dataset specifications.
Feature Engineering:
- Generating word embeddings using FastText.
Data Splitting:
- Splitting data using K-Fold cross-validation with k=5.
Model Development:
- Constructing LSTM architecture with optimized hyperparameters.
Evaluation:
- Evaluating performance using accuracy, precision, recall, F1-score, and confusion matrix.

Results

The best model achieved an accuracy of 93.55% with a loss of 0.3108.
Optimal hyperparameters:
- Vector size: 8
- Batch size: 32
- Epochs: 50
- Unique words: 11,147

Comparison with Previous Studies

This research achieved higher accuracy compared to prior methods:

Graph Convolutional Networks (GCN): 81.5%
Graph Attention Networks (GAT): 83%
SplineCNN : 89.48%
Graph Convolutional Networks with Kronecker-Factored Approximate Curvature (GCN Adam-KFAC) : 90.16%

Confusion Matrix

The confusion matrix analysis showed high precision and recall for all seven classes, with most scores exceeding 97%.

Conclusion

The research demonstrates that LSTM combined with FastText embeddings effectively classifies scientific article abstracts from the Cora dataset. Key findings include:

The use of stemming and filtering unique words significantly enhances accuracy.
Larger input sizes generally improve accuracy but increase training time.

Citation

If you use this repository or dataset, please cite:

Daffa Fikri. 2025. Klasifikasi Abstrak Artikel Ilmiah pada Dataset Cora Menggunakan Long Short Term Memory. Institut Pertanian Bogor.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Classification of Scientific Article Abstracts on Cora Dataset Using Long Short Term Memory

Table of Contents

Introduction

Dataset

Dataset Source

Key Features of the Dataset

Research Stages

Results

Comparison with Previous Studies

Confusion Matrix

Conclusion

Citation

About

Uh oh!

Releases

Packages

Katchuushaa/CoraClassificationwithLSTM

Folders and files

Latest commit

History

Repository files navigation

Classification of Scientific Article Abstracts on Cora Dataset Using Long Short Term Memory

Table of Contents

Introduction

Dataset

Dataset Source

Key Features of the Dataset

Research Stages

Results

Comparison with Previous Studies

Confusion Matrix

Conclusion

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages