Skip to content

aryankargwal/wine-connoisseur

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

wine-connoisseur

In this project I have explored Red Wine Quality dataset and further tried to predict wine quality from the data.



With the increase in social consumption, the red wine industry has recently grown exponentially. Today, industry participants are using product quality certification to promote their products. This is a time-consuming process that requires evaluation by human experts, making this process very expensive.
With the human evaluation being a rather abstract concept this dataset aims to classify the quality of wine purely on the basis of its chemical composition.

Dataset

The dataset populated with chemical components of the wine samples has over 1600 rows with columns describing various aspects such as:

  • Fixed Acidity
  • Volatile Acidity
  • Citric Acid
  • Residual Sugar
  • Chlorides
  • Free Sulfer Dioxide
  • Total Sulfer Dioxide
  • Density
  • pH
  • Sulphates
  • ALcohol

Training Methods

Intial Training

The initial training of model is performed using 3 models which on further analysis will be further optimized for best results. The models that are used are:


  • Random Forest Classifier: Random forest classifier creates a set of decision trees from randomly selected subset of training set. It then aggregates the votes from different decision trees to decide the final class of the test object.

  • Stochastic Gradient Descent Classifier: Stochastic Gradient Descent (SGD) is a simple yet efficient optimization algorithm used to find the values of parameters/coefficients of functions that minimize a cost function. In other words, it is used for discriminative learning of linear classifiers under convex loss functions such as SVM and Logistic regression.

  • Support Vector Classifier: A support vector machine (SVM) is a supervised machine learning model that uses classification algorithms for two-group classification problems. After giving an SVM model sets of labeled training data for each category, they're able to categorize new text.

Improvements

Having the results from the basic training we use optimization techniques to get the best result that can be achieved.


  • Grid Search: Grid search is a process that searches exhaustively through a manually specified subset of the hyperparameter space of the targeted algorithm.

  • Cross Validation: We will be sing Cross Validation for Random Forest results.
    Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation.

Requirements

Setting up the python requirements using requirements.txt
pip install -r requirements.txt

Credits

A lot of my code was inspired by this notebook.

About

Data exploration and prediction of quality of wine.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published