This repository provides a tutorial on implementing language classification using the Multinomial Naive Bayes algorithm. The tutorial includes a Python implementation to detect the language of a given text. The code consists of two main files: main.py
for user interaction and detector.py
containing the LanguageClassifier
class.
The Multinomial Naive Bayes algorithm is widely used for text classification tasks, including language identification. This tutorial demonstrates how to train a language classifier using a provided dataset and then use the trained model to predict the language of input text.
Before running the code, ensure you have the following dependencies installed:
- Python
- Required libraries:
requests
,bs4
,pandas
,scikit-learn
,joblib
Install the necessary dependencies using the following command:
pip install requests bs4 pandas scikit-learn joblib
-
Clone the Repository:
git clone https://github.com/vivekkdagar/NaiveBayesClassifier.git cd NaiveBayesClassifier
-
Run the Main Script:
python3 main.py
-
Select Data Source and input data:
- Choose the mode ('raw', 'file', or 'website') to input text data.
-
Results:
- The predicted language for the provided text will be displayed.
main.py
: Handles user interaction and data input.detector.py
: Contains theLanguageClassifier
class responsible for training and predicting languages.
The LanguageClassifier
class preprocesses the training data by removing special characters and transforming the text into a bag-of-words representation using the CountVectorizer
from scikit-learn.
The tutorial uses a provided dataset, "Language Detection.csv," to train the Multinomial Naive Bayes model. The model is then serialized using the joblib
library for future use.
- To modify or extend the training dataset, edit the "Language Detection.csv" file.
- Adjust the HTML tag in the
scrape_website
function withinmain.py
based on your specific use case.