This is a sentiment analysis project using NLTK and Scikit-learn libraries. In this jupyter notebook, well-known machine learning algorithms are trained on a Twitter dataset and then applied to the "The National University of Singapore SMS Corpus dataset".
Finally, the percentage of positive and negative messages is compared based on 10 different countries.
All python packages needed are listed in requirements.txt file and can be installed simply using the pip command.
Public access to the dataset is provided by The National University of Singapore. This dataset contains 67,093 text messages (SMSs) taken from the corpus on Mar 9, 2015 and mostly is comprised of messages from Singaporeans and students attending the University. You can download it from this.
- Changing to lowercase and removing punctuation,
- Removing empty messages
- Tokenizing the messages
- Removing stopwords
- Creating bag of word(BOW)
- Vectorizing And for better visualization of the dataset, the word clouds of positive and negative sets are plotted.
Eight well-known machine learning classifiers are trained on the Twitter dataset, and the accuracy of the validation set is printed in a table. The models are built with the Scikit-Learn library.
The classifiers are applied to the dataset from the National University of Singapore, and the calculated predicted negative and positive percentages are printed for the entire dataset as well as for each country.
This project is done as the proposed portfolio of https://www.codecademy.com/learn/paths/natural-language-processing.