This repository contains the implementation and results of my undergraduate thesis research, titled "Klasifikasi Abstrak Artikel Ilmiah pada Dataset Cora Menggunakan Long Short Term Memory". The research aims to classify scientific articles based on their abstracts using the Long Short Term Memory (LSTM) model and FastText word embedding.
In this study, text classification was conducted on the Cora dataset using the LSTM model. The dataset contains scientific article abstracts grouped into seven topics:
- Neural Networks
- Probabilistic Methods
- Genetic Algorithms
- Theory
- Case Based
- Reinforcement Learning
- Rule Learning
By utilizing text mining techniques, this research focuses on improving classification accuracy using FastText embeddings and optimizing hyperparameters.
The Cora Dataset was used, which consists of 2,708 scientific articles. Each article is represented by its abstract and associated topic label. For this research, pre-processing steps such as cleaning, stemming, stopword removal, and tokenization were applied to prepare the data.
- Number of articles: 2,708
- Number of unique words: 16,155 (reduced to 1,433 after filtering)
- Seven predefined classes/topics
- Data Collection: Combining the content file and abstract data to form a unified dataset.
- Pre-processing:
- Cleaning text by removing punctuation and case folding.
- Applying stemming and stopword removal.
- Filtering unique words to match dataset specifications.
- Feature Engineering:
- Generating word embeddings using FastText.
- Data Splitting:
- Splitting data using K-Fold cross-validation with k=5.
- Model Development:
- Constructing LSTM architecture with optimized hyperparameters.
- Evaluation:
- Evaluating performance using accuracy, precision, recall, F1-score, and confusion matrix.
- The best model achieved an accuracy of 93.55% with a loss of 0.3108.
- Optimal hyperparameters:
- Vector size: 8
- Batch size: 32
- Epochs: 50
- Unique words: 11,147
This research achieved higher accuracy compared to prior methods:
- Graph Convolutional Networks (GCN): 81.5%
- Graph Attention Networks (GAT): 83%
- SplineCNN : 89.48%
- Graph Convolutional Networks with Kronecker-Factored Approximate Curvature (GCN Adam-KFAC) : 90.16%
The confusion matrix analysis showed high precision and recall for all seven classes, with most scores exceeding 97%.
The research demonstrates that LSTM combined with FastText embeddings effectively classifies scientific article abstracts from the Cora dataset. Key findings include:
- The use of stemming and filtering unique words significantly enhances accuracy.
- Larger input sizes generally improve accuracy but increase training time.
If you use this repository or dataset, please cite:
Daffa Fikri. 2025. Klasifikasi Abstrak Artikel Ilmiah pada Dataset Cora Menggunakan Long Short Term Memory. Institut Pertanian Bogor.