Lab 01: Urbana Weather

For Lab 01, you will use weather data to develop a model that will predict the minimum temperature in Urbana, IL for a particular day of the year.

Background

Lincoln Square is home to both the Common Ground Food Co-Operative and the Market at the Square. Market at the Square is the Urbana Farmers market.1

Because the weather in Urbana is not so nice year-round, the market is only outdoors during reasonable weather. As winter approaches, some of the vendors continue in a more limited capacity, but indoors at Lincoln Square.

Scenario and Goal

You are the manager for the Market at the Square, the local Urbana Farmer’s Market. Each year, sometime in Autumn, the market moves from outdoors to indoors. You’d like to be able to reliably predict when to make the move, but well in advance, to give vendors certainty about when the change will take place, as not all vendors make the switch to indoors. You hope to find a model for the minimum daily temperature (as the market opens early in the morning, and vendors arrive even earlier) so that you can predict when it will be too cold to hold the market outdoors.

Data

To achieve the goal of this lab, we will need historical weather data. The necessary data is provided in the following files:

Source

The Urbana Weather data was collected using the Open-Meteo API. Specifically, the Historical Weather API was used.

  • Zippenfenig, P. (2023). Open-Meteo.com Weather API [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.7970649

The Historical Weather API is based on reanalysis datasets and uses a combination of weather station, aircraft, buoy, radar, and satellite observations to create a comprehensive record of past weather conditions. These datasets are able to fill in gaps by using mathematical models to estimate the values of various weather variables. As a result, reanalysis datasets are able to provide detailed historical weather information for locations that may not have had weather stations nearby, such as rural areas or the open ocean.

Additional citations specific the the weather models used by the API can be found on the Open-Meteo website.

The Urbana Weather data was accessed using:

  • Latitude: 40.1106
  • Longitude: -88.2073

On a map, this places the location almost exactly at Lincoln Square.

Open-Meteo provides excellent documentation on their APIs.

The above link will provide detailed information about how to use the API for the Urbana location. It will even automatically generate Python code to make a request to the API and collect the results as a pandas data frame!

Data Dictionary

The data here is not split randomly. The different datasets are split according to time.

  • Train: 2016 - 2021
  • Validation-Train: 2016 - 2019
  • Validation: 2020 - 2021
  • Test: 2022
  • Production: 2023

This data is pre-split as we have not yet fully discussed data splitting. Also note that the validation-train plus validation data together make up the full train data.

The index of each data frame is the full date using the ISO 8601 standard. For example, 2020-07-04.

Response

temperature_2m_min

  • [float64] the minimum air temperature at 2 meters above ground for the day

Features

year

  • [int64] year , such as 2020

month

  • [int64] month , such as 10 for October

day

  • [int64] day of the month, for example 20 for January 20

day_of_year

  • [int64] day of the year, for example 100, which in non-leap years in is April 9

For this lab, we require the use of only the year and day_of_year features.

Data in Python

To load the data in Python, use:

import pandas as pd
weather_train = pd.read_csv(
    "https://cs307.org/lab-01/data/weather-train.csv",
    index_col="date",
    parse_dates=True
)
weather_vtrain = pd.read_csv(
    "https://cs307.org/lab-01/data/weather-vtrain.csv",
    index_col="date",
    parse_dates=True
)
weather_validation = pd.read_csv(
    "https://cs307.org/lab-01/data/weather-validation.csv",
    index_col="date",
    parse_dates=True
)
weather_test = pd.read_csv(
    "https://cs307.org/lab-01/data/weather-test.csv",
    index_col="date",
    parse_dates=True
)
weather_train
temperature_2m_min year month day day_of_year
date
2016-01-01 -4.2715 2016 1 1 1
2016-01-02 -3.8715 2016 1 2 2
2016-01-03 -4.4715 2016 1 3 3
2016-01-04 -3.0215 2016 1 4 4
2016-01-05 -5.7715 2016 1 5 5
... ... ... ... ... ...
2021-12-27 6.9980 2021 12 27 361
2021-12-28 1.7980 2021 12 28 362
2021-12-29 2.1980 2021 12 29 363
2021-12-30 0.5980 2021 12 30 364
2021-12-31 4.7480 2021 12 31 365

2192 rows × 5 columns

Prepare Data for Machine Learning

Create the X and y variants of the data for use with sklearn:

# create X and y for train
X_train = weather_train[["year", "day_of_year"]]
y_train = weather_train["temperature_2m_min"]

# create X and y for validation-train
X_vtrain = weather_vtrain[["year", "day_of_year"]]
y_vtrain = weather_vtrain["temperature_2m_min"]

# create X and y for validation
X_validation = weather_validation[["year", "day_of_year"]]
y_validation = weather_validation["temperature_2m_min"]

# create X and y for test
X_test = weather_test[["year", "day_of_year"]]
y_test = weather_test["temperature_2m_min"]

You can assume that within the autograder, similar processing is performed on the production data.

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.

Models

For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:

  • Models must start from the given training data, unmodified.
    • Importantly, the types and shapes of X_train and y_train should not be changed.
    • In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
    • In the autograder, we will call mod.predict(X_prod) on your model, where your model is loaded as mod and X_prod has a compatible shape with and the same variable names and types as X_train.
  • Your model must have a fit method.
  • Your model must have a predict method.
  • Your model should be created with scikit-learn version 1.5.1 or newer.
  • Your model should be serialized with joblib version 1.4.2 or newer.
    • Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the checks in the autograder.

To obtain the maximum points via the autograder, your model performance must meet or exceed:

Test RMSE: 5.5
Production RMSE: 5.5

Model Persistence

To save your model for submission to the autograder, use the dump function from the joblib library. Check PrairieLearn for the filename that the autograder expects for this lab.

from joblib import dump
dump(mod, "filename.joblib")

Discussion

As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! Justify your conclusion. If you trained multiple models that are mentioned in your report, first make clear which model you selected and are considering for use in practice.

Additional discussion prompts:

  • Does the overall strategy here seem appropriate? Do you have any general weather knowledge that suggests an obvious flaw here?
    • Be sure you have read the data background, paying attention to how the data was collected and split.
  • Assuming you used KNN, does distance make sense here? What is the distance between two dates in time for this data? Does this actually make sense?

When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!

Template Notebook

Submission

Before submission, especially of your report, you should be sure to review the Lab Policy page.

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.

Footnotes

  1. If you’ve never been, we highly recommend it!↩︎