Wine Quality

This page presents information about the Wine Quality dataset which will be used as a part of Lab 04 in CS 307.

Source

The original source of the data is the following paper:

  • Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision support systems, 47(4), 547-553. https://doi.org/10.1016/j.dss.2009.05.016

However, the data from this paper has become a standard dataset in the machine learning community, and thus is made available via the UC Irvine Machine Learning Repository.

The original data contains two separate datasets, one for red wine and one for white wine. Here, we have combined the data and added a column for the color of the wine. We have made additional modifications to the original data, including a train-test split.

Data Dictionary

The Wine Quality dataset used in CS 307 will include preprocessing for specific use in Lab 04. We document this specific data here. Each observation contains information about a particular Portuguese “Vinho Verde” wine.

Vinho verde is a unique product from the Minho (northwest) region of Portugal. Medium in alcohol, is it particularly appreciated due to its freshness (specially in the summer).

quality

  • [int64] the quality of the wine based on evaluation by a minimum of three sensory assessors (using blind tastes), which graded the wine in a scale that ranges from 0 (very bad) to 10 (excellent)

color

  • [object] the (human perceivable) color of the wine, red or white

fixed acidity

  • [float64] grams of tartaric acid per cubic decimeter

volatile acidity

  • [float64] grams of acetic acid per cubic decimeter

citric acid

  • [float64] grams of citric acid per cubic decimeter

residual sugar

  • [float64] grams of residual sugar per cubic decimeter

chlorides

  • [float64] grams of sodium chloride cubic decimeter

free sulfur dioxide

  • [float64] milligrams of free sulfur dioxide per cubic decimeter

total sulfur dioxide

  • [float64] milligrams of total sulfur dioxide per cubic decimeter

density

  • [float64] the total density of the wine in grams per cubic centimeter

pH

  • [float64] the acidity of the wine measured using pH

sulphates

  • [float64] grams of potassium sulphate cubic decimeter

alcohol


Original and complete documentation for this data can be found in the original paper. Additionally, minimal documentation is provided by the UCI MLR.

Data for Machine Learning

For CS 307 lab, we will provide training data, stored as a CSV file, accessible via the web.

The test data is only available within the autograder.

Loading the Data

import pandas as pd
wine_train = pd.read_csv("https://cs307.org/lab-04/data/wine-train.csv")
wine_train
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality color
0 6.6 0.240 0.35 7.70 0.031 36.0 135.0 0.99380 3.19 0.37 10.5 5 white
1 8.3 0.280 0.48 2.10 0.093 6.0 12.0 0.99408 3.26 0.62 12.4 7 red
2 7.7 0.715 0.01 2.10 0.064 31.0 43.0 0.99371 3.41 0.57 11.8 6 red
3 5.2 0.370 0.33 1.20 0.028 13.0 81.0 0.99020 3.37 0.38 11.7 6 white
4 6.6 0.260 0.56 15.40 0.053 32.0 141.0 0.99810 3.11 0.49 9.3 5 white
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5192 7.6 0.320 0.58 16.75 0.050 43.0 163.0 0.99990 3.15 0.54 9.2 5 white
5193 5.6 0.280 0.27 3.90 0.043 52.0 158.0 0.99202 3.35 0.44 10.7 7 white
5194 6.4 0.370 0.20 5.60 0.117 61.0 183.0 0.99459 3.24 NaN 9.5 5 white
5195 6.5 0.260 0.50 NaN 0.051 46.0 197.0 0.99536 3.18 0.47 9.5 5 white
5196 7.2 0.620 0.06 2.70 0.077 15.0 85.0 0.99746 3.51 0.54 9.5 5 red

5197 rows × 13 columns