import pandas as pd
Lab 04: Wine Quality
Scenario: You work for a startup that wants to create an AI Sommelier. Rather than using a highly trained human, instead, you will purchase chemistry equipment to generate physicochemical data for wine, and train models based on previous wine quality reviews by human sommeliers.
Goal
The goal of this lab is to create a model that predicts the quality of a wine given its physicochemical characteristics.
- Note: We do not state if this is a classification or regression task. In practice, either strategy could be used.
Data
This lab will use data from the UC Irvine Machine Learning Repository.
Response
quality
Features
fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol
color
Data in Python
To load the data in Python, use:
= pd.read_csv("https://cs307.org/lab-04/data/wine-train.csv") wine_train
To create the X
and y
variants of the training data, use:
# create X and y for train
= wine_train.drop("quality", axis=1)
X_train = wine_train["quality"] y_train
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.
Models
For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:
- Models must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
. - We assume that you will use a
Pipeline
andGridSearchCV
fromsklearn
as you will need to deal with heterogeneous data, and you should be using cross-validation.- So more specifically, you should create a
Pipeline
that is fit withGridSearchCV
. Done correctly, this will store a “model” that you can submit to the autograder.
- So more specifically, you should create a
- Importantly, the types and shapes of
- Your model must have a
fit
method. - Your model must have a
predict
method that returns numbers. - Your model should be created with
scikit-learn
version1.4.0
or newer. - Your model should be serialized with
joblib
version1.3.2
or newer.- Your serialized model must be less than 5MB.
To obtain the maximum points via the autograder, your model performance must meet or exceed:
Test MAE: 0.47
Model Persistence
To save your model for submission to the autograder, use the dump
function from the joblib
library. Check PrairieLearn for the filename that the autograder expects.
Discussion
As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! If you are asked to train multiple models, first make clear which model you selected and are considering for use in practice. Discuss any limitations or potential improvements.
Additional discussion topics:
- Generally comment on the real-world applicability of this model. Is the cost of the chemistry equipment and processes worth removing humans from this process? Can this “AI” replace all aspects of a sommelier?
- The
quality
data given are integers. Are you predictions integers? If not, is that a problem?
When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!
Template Notebook
Submission
On Canvas, be sure to submit both your source .ipynb
file and a rendered .html
version of the report.