Palmer Penguins

This page presents information about the Palmer Penguins dataset which will be used as a part of Lab 00 in CS 307.

A duck walks into a drugstore to purchase some lip balm.

While checking out, the cashier asks: “Will that be cash or credit?”

The duck responds: “Put it on my bill.”

– As told by Andrew “Andy” Glaysher, BS Media Studies, 2009

Source

The Palmer Penguins data were originally collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

  • Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081

The data as used here is derived from the palmerpenguins Python package that is a port of the palmerpenguins R package. Use the load_penguins() function from either package to obtain the data as originally published via these packages, however, that is not the data you should use for lab in CS 307.

The Penguins

Meet the Palmer Penguins: Chinstrap, Gentoo1, and Adelie!

Cartoon images of the three specifies of penguins in the Palmer Penguins dataset: .Chinstrap, Gentoo, and Adelie!

In lab, we will focus on the length and depth of the penguins’ bills. We’ll attempt to use these measurements to predict the species of each penguin.

A cartoon explaining how bill depth and length are measured.


Artwork by Allison Horst.

Data Dictionary

The Palmer Penguins dataset includes size measurements for adult foraging penguins near Palmer Station, Antarctica. The data includes measurements for penguin species, island in Palmer Archipelago, size (flipper length, body mass, bill dimensions), and sex.

species

  • a string denoting penguin species (Adélie, Chinstrap and Gentoo)

island

  • a string denoting island in Palmer Archipelago, Antarctica (Biscoe, Dream or Torgersen)

bill_length_mm

  • a number denoting bill length (millimeters)

bill_depth_mm

  • a number denoting bill depth (millimeters)

flipper_length_mm

  • a number denoting flipper length (millimeters)

body_mass_g

  • a number denoting body mass (grams)

sex

  • a string denoting penguin sex (female, male)

year

  • an integer denoting the study year (2007, 2008, or 2009)

This dictionary is a modification of the original reference page from the R package with modifications appropriate for the data as presented in Python.

Data for Machine Learning

For CS 307 lab, we provide a pre-split train and test datasets, stored as CSV files, accessible via the web.

Loading the Data

import pandas as pd
penguins_train = pd.read_csv("https://cs307.org/lab-00/data/penguins-train.csv")
penguins_test = pd.read_csv("https://cs307.org/lab-00/data/penguins-test.csv")
penguins_train
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Biscoe 40.5 17.9 187.0 3200.0 female 2007
1 Chinstrap Dream 49.2 18.2 195.0 4400.0 male 2007
2 Chinstrap Dream 52.8 20.0 205.0 4550.0 male 2008
3 Adelie Biscoe 37.6 17.0 185.0 3600.0 female 2008
4 Gentoo Biscoe 47.3 15.3 222.0 5250.0 male 2007
... ... ... ... ... ... ... ... ...
228 Gentoo Biscoe 49.6 15.0 216.0 4750.0 male 2008
229 Adelie Torgersen 37.2 19.4 184.0 3900.0 male 2008
230 Adelie Biscoe 39.7 17.7 193.0 3200.0 female 2009
231 Chinstrap Dream 45.2 17.8 198.0 3950.0 female 2007
232 Adelie Biscoe 38.1 17.0 181.0 3175.0 female 2009

233 rows × 8 columns

penguins_test
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Dream 39.5 16.7 178.0 3250.0 female 2007
1 Chinstrap Dream 50.9 17.9 196.0 3675.0 female 2009
2 Adelie Torgersen 42.1 19.1 195.0 4000.0 male 2008
3 Gentoo Biscoe 46.6 14.2 210.0 4850.0 female 2008
4 Adelie Biscoe 41.1 18.2 192.0 4050.0 male 2008
... ... ... ... ... ... ... ... ...
95 Adelie Biscoe 37.8 18.3 174.0 3400.0 female 2007
96 Adelie Torgersen 39.2 19.6 195.0 4675.0 male 2007
97 Gentoo Biscoe 45.8 14.2 219.0 4700.0 female 2008
98 Adelie Dream 43.2 18.5 192.0 4100.0 male 2008
99 Adelie Dream 39.2 21.1 196.0 4150.0 male 2007

100 rows × 8 columns

History

As you move through your journey as a data scientist, you’ll like encounter the Palmer Penguins dataset with some frequency. An interesting question is, why?

In many ways, the Palmer Penguins data is meant to replace the even more ubiquitous Iris flower data. This again leads to the question, why? In many ways, it is a better dataset because it is useful for everything that the Iris data was, but also more.

The Palmer Penguins data…

  • uses a permissive CC0 license.
  • uses understandable variables.
  • has complete metadata and documentation.
  • is a manageable, but less trivial size than the Iris data.
  • requires little to no processing for many analyses.
  • is real world, not simulated data.
  • includes addition feature variables, including categorical variables, as well as missing data, providing more opportunity for usage and learning.

This reasoning is expanded upon in the R Journal article introducing the dataset.

All that is wonderful, but there was another compelling reason to move away from the Iris data, racism. How can flowers be racist? Well, of course flowers themselves cannot be racist but the origin of the data has to be considered.

If you study statistics for any length of time, you’ll start to notice that many methods and theories can be traced to one person, Ronald Aylmer (R.A.) Fisher.2

Like many of his contemporaries, Fisher’s interest in statistics and biology cannot be separated from his interest in and study of eugenics. While this is a very unfortunate part of the history of statistics, it cannot simply be ignored. Many of the most foundational statistical methods were developed by researchers and scientists at least interested in if not pursuing eugenics. If you were not already of the mind that statistical methodology should be applied with great care and ethical considering, hopefully this history will change your perspective.

But how does this all relate to flowers? Well, the Iris data often goes by another name: Fisher’s Iris data set.3 If we stopped using all of Fisher’s creations in statistics, we might not be left with much. But this particular artifact was also published in the Annals of Eugenics.

For additional reading and background:

Footnotes

  1. Linux users might recognize the Gentoo name.↩︎

  2. Interestingly, among his many faults, Fisher did not believe that smoking caused cancer. Especially at the time, his argument was not completely without merit, but his thoughts on the topic were used to combat the arguments made by the Surgeon General’s report in the 1960s making the claim that smoking caused cancer..↩︎

  3. The data was actually gather by Edgar Anderson, so you will also see it referred to as Anderson’s Iris data set.↩︎