import pandas as pd
Lab 05: Wine Quality
For Lab 05, you will use wine data to develop a model that will predict a wine’s quality given it’s characteristics.
Background
Wine is a popular alcoholic beverage made from fermented fruit. A Sommelier is a professional that specializes in wine services, especially wine-food pairings. Sommeliers receive extensive training.
Scenario and Goal
You work for a startup that wants to create an AI Sommelier. Rather than using a highly trained human, instead, you will purchase chemistry equipment to generate physicochemical data for wines, and train models based on previous wine quality reviews by human sommeliers.
Your goal is to create a model that predicts a wine’s quality given its physicochemical characteristics.
- Note: We do not state if this is a classification or regression task. In practice, either strategy could be used.
Data
To achieve the goal of this lab, we will need wine quality data. The necessary data is provided in the following files:
Source
The original source of the data is the following paper:
- Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision support systems, 47(4), 547-553. https://doi.org/10.1016/j.dss.2009.05.016
However, the data from this paper has become a standard dataset in the machine learning community, and thus is made available via the UC Irvine Machine Learning Repository.
The original data contains two separate datasets, one for red wine and one for white wine. Here, we have combined the data and added a column for the color
of the wine. We have made additional modifications to the original data.
Data Dictionary
Each observation in the train, test, and (hidden) production data contains information about a particular Portuguese “Vinho Verde” wine.
Vinho verde is a unique product from the Minho (northwest) region of Portugal. Medium in alcohol, is it particularly appreciated due to its freshness (specially in the summer).
Original and complete documentation for this data can be found in the original paper. Additionally, minimal documentation is provided by the UCI MLR.
Response
quality
[int64]
the quality of the wine based on evaluation by a minimum of three sensory assessors (using blind tastes), which graded the wine in a scale that ranges from 0 (very bad) to 10 (excellent)
Features
color
[object]
the (human perceivable) color of the wine, red or white
fixed acidity
[float64]
grams of tartaric acid per cubic decimeter
volatile acidity
[float64]
grams of acetic acid per cubic decimeter
citric acid
[float64]
grams of citric acid per cubic decimeter
residual sugar
[float64]
grams of residual sugar per cubic decimeter
chlorides
[float64]
grams of sodium chloride cubic decimeter
free sulfur dioxide
[float64]
milligrams of free sulfur dioxide per cubic decimeter
total sulfur dioxide
[float64]
milligrams of total sulfur dioxide per cubic decimeter
density
[float64]
the total density of the wine in grams per cubic centimeter
pH
[float64]
the acidity of the wine measured using pH
sulphates
[float64]
grams of potassium sulphate cubic decimeter
alcohol
[float64]
percent alcohol by volume
Data in Python
To load the data in Python, use:
= pd.read_csv(
wine_train "https://cs307.org/lab-05/data/wine-train.csv",
)= pd.read_csv(
wine_test "https://cs307.org/lab-05/data/wine-test.csv",
)
wine_train
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | color | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.6 | 0.23 | 0.64 | 12.9 | 0.033 | 54.0 | 170.0 | 0.99800 | 3.00 | 0.53 | 8.8 | 5 | white |
1 | NaN | 0.75 | 0.01 | 2.2 | 0.059 | 11.0 | 18.0 | 0.99242 | 3.39 | 0.40 | NaN | 6 | red |
2 | 7.4 | 0.67 | 0.12 | 1.6 | 0.186 | 5.0 | 21.0 | 0.99600 | 3.39 | 0.54 | 9.5 | 5 | red |
3 | 6.4 | 0.18 | 0.74 | NaN | 0.046 | 54.0 | 168.0 | 0.99780 | 3.58 | 0.68 | 10.1 | 5 | white |
4 | 6.7 | 0.35 | 0.32 | 9.0 | 0.032 | 29.0 | 113.0 | 0.99188 | 3.13 | 0.65 | 12.9 | 7 | white |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4152 | 6.6 | 0.32 | 0.22 | 16.7 | 0.046 | 38.0 | 133.0 | 0.99790 | 3.22 | 0.67 | 10.4 | 6 | white |
4153 | 9.2 | 0.58 | 0.20 | 3.0 | 0.081 | 15.0 | 115.0 | 0.99800 | 3.23 | 0.59 | 9.5 | 5 | red |
4154 | 8.2 | 0.60 | 0.17 | 2.3 | 0.072 | 11.0 | 73.0 | 0.99630 | 3.20 | 0.45 | 9.3 | 5 | red |
4155 | 6.5 | 0.23 | 0.36 | 16.3 | 0.038 | 43.0 | 133.0 | 0.99924 | 3.26 | 0.41 | 8.8 | 5 | white |
4156 | 6.4 | 0.19 | 0.35 | 10.2 | 0.043 | 40.0 | 106.0 | 0.99632 | 3.16 | 0.50 | 9.7 | 6 | white |
4157 rows × 13 columns
Prepare Data for Machine Learning
Create the X
and y
variants of the data for use with sklearn
:
# create X and y for train
= wine_train.drop("quality", axis=1)
X_train = wine_train["quality"]
y_train
# create X and y for test
= wine_test.drop("quality", axis=1)
X_test = wine_test["quality"] y_test
You can assume that within the autograder, similar processing is performed on the production data.
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.
Models
For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:
- Models must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
. - In the autograder, we will call
mod.predict(X_prod)
on your model, where your model is loaded asmod
andX_prod
has a compatible shape with and the same variable names and types asX_train
. - We assume that you will use a
Pipeline
andGridSearchCV
fromsklearn
as you will need to deal with heterogeneous data, and you should be using cross-validation.- So more specifically, you should create a
Pipeline
that is fit withGridSearchCV
. Done correctly, this will store a “model” that you can submit to the autograder.
- So more specifically, you should create a
- Importantly, the types and shapes of
- Your model must have a
fit
method. - Your model must have a
predict
method. - Your model should be created with
scikit-learn
version1.5.2
or newer. - Your model should be serialized with
joblib
version1.4.2
or newer.- Your serialized model must be less than 5MB.
To obtain the maximum points via the autograder, your model performance must meet or exceed:
Test MAE: 0.49
Production MAE: 0.49
Model Persistence
To save your model for submission to the autograder, use the dump
function from the joblib
library. Check PrairieLearn for the filename that the autograder expects for this lab.
from joblib import dump
"filename.joblib") dump(mod,
Discussion
As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! Justify your conclusion. If you trained multiple models that are mentioned in your report, first make clear which model you selected and are considering for use in practice.
Additional discussion prompts:
- Is the cost of the chemistry equipment and processes worth removing humans from this process? Can this “AI” replace all aspects of a sommelier?
- The
quality
data given are integers. Are your predictions integers? If not, is that a problem? Are you using classification or regression?
When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!
Template Notebook
Submission
On Canvas, be sure to submit both your source .ipynb
file and a rendered .html
version of the report.