Lab 00: Palmer Penguins

Scenario: You are a researcher that has previously collected physical measurements and species information about penguins at Palmer Station, Antarctica. You worry that future researchers may not be proficient at identifying penguin species visually. Instead, you hope to develop a model that can accurately identify penguin species based on the depth and length of their bills, measurements that are easy to train researchers to collect.

This setup is 100% fabricated by Dave and may not relate in any way to a practical real world situation. Is it hard to identify penguin species visually? Is it easy to measure a penguin’s bill length and depth? Who knows?1 However, in CS 307, we will always stress the evaluation of models in the context of a real world scenario! So, for the purpose of this lab, please buy into the premise. Some labs will have a more realistic setup than others2, but we will always work within some pre-stated scenario.

Goal

The goal of this lab is to create a classifier for the species of a penguin based on its measurable physical characteristics, specifically its bill depth and length.

Data

This lab will use the Palmer Penguins data. In particular, we will focus on the bill_length_mm and bill_depth_mm variables as features, used to predict the response variable species.

Response

  • species

Features

  • bill_length_mm
  • bill_depth_mm

Data in Python

To load the data in Python, use:

import pandas as pd
penguins_train = pd.read_csv("https://cs307.org/lab-00/data/penguins-train.csv")
penguins_test = pd.read_csv("https://cs307.org/lab-00/data/penguins-test.csv")

To create the X and y data for training and testing, use:

X_train = penguins_train[["bill_length_mm", "bill_depth_mm"]]
y_train = penguins_train["species"]
X_test = penguins_test[["bill_length_mm", "bill_depth_mm"]]
y_test = penguins_test["species"]

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.

Models

For this lab you will fit two models.

  1. DummyClassifier
  2. DecisionTreeClassifier

For both, use all default parameters. They should also be fit directly to the training data provided, as this is the format that the autograder will expect.

For this lab you are being told the exact models to fit. However, in the future, you will likely be given values of metrics that you need to achieve for full credit. For this lab, those will be:

print(f"DummyClassifier: {results['dummy_acc']}")
print(f"DecisionTreeClassifier: {results['dummy_acc']}")
DummyClassifier: 0.48
DecisionTreeClassifier: 0.48

If you fit the models as described above, these metrics will be achieved automatically.

Model Persistence

To save your models for submission to the autograder, use the dump function from the joblib library. This process is called serialization. If you have not used this library before, you will need to install it.

from joblib import dump
dump(model_object, "filename.joblib")

Use the following filenames when saving your models:

  • DummyClassifier: penguins-dummy.joblib
  • DecisionTreeClassifier: penguins-dt.joblib

The autograder will only accept these filenames.

Discussion

As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! If you are asked to train multiple models, first make clear which model you selected and are considering for use in practice. Discuss any limitations or potential improvements.

In future labs, you may be given additional discussion prompts that should be answered in the discussion section.

Template Notebook

Submission

Before submission, especially of your report, you should be sure to review the Lab Policy page!

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.

Footnotes

  1. Probably penguin researchers.↩︎

  2. Given Dave’s interest in baseball analytics, those scenarios will be as real world as humanly possible.↩︎