Lab 02: Weather

For Lab 02, you will use weather data to develop a model that will predict the minimum temperature in Urbana, IL for a particular day of the year.

Background

Lincoln Square is home to both the Common Ground Food Co-Operative and the Market at the Square. Market at the Square is the Urbana Farmers market.1

Because the weather in Urbana is not so nice year-round, the market is only held outdoors during reasonable weather. As winter approaches, some of the vendors continue in a more limited capacity, but indoors at Lincoln Square.

Scenario and Goal

Who are you?

  • You work as staff for the City of Urbana. The current date is January 1, 2024.

What is your task?

  • The Office of the Mayor has asked you to provide an estimate of when the market should move indoors in 2025 towards the end of the year as winter arrives.

Who are you writing for?

  • To summarize your work, you will write a report for the Office of the Mayor, to be shared with the full City Council. Because theses individuals are not familiar with machine learning, you will need to be sure to make your recommendation clear, without using too much ML jargon. The Office of the Mayor and the City Council are of course familiar with the Market at the Square and the freezing temperatures that arrive in Urbana during late Fall and early Winter at the end of the calendar year.

Data

To achieve the goal of this lab, we will need historical weather data. The necessary data is provided in the following files:

Source

The Urbana weather data was collected using the Open-Meteo API. Specifically, the Historical Weather API was used.

The Historical Weather API is based on reanalysis datasets and uses a combination of weather stations, aircraft, buoy, radar, and satellite observations to create a comprehensive record of past weather conditions. These datasets are able to fill in gaps by using mathematical models to estimate the values of various weather variables. As a result, reanalysis datasets are able to provide detailed historical weather information for locations that may not have had weather stations nearby, such as rural areas or the open ocean.

Additional citations specific the the weather models used by the API can be found on the Open-Meteo website.

The Urbana data was accessed using:

  • Latitude: 40.1106
  • Longitude: -88.2073

On a map, this places the location almost exactly at Lincoln Square.

Open-Meteo provides excellent documentation for their APIs.

The above link will provide detailed information about how to use the API for the Urbana location. It will automatically generate Python code to make a request to the API and collect the results as a pandas data frame!

Data Dictionary

The provided data is not split randomly. The data is split according by time, specifically based on years.

  • Train: 2016 - 2022
  • Validation-Train: 2016 - 2020
  • Validation: 2021 - 2022
  • Test: 2023
  • Production: 2024

This data is pre-split as we have not yet fully discussed data splitting, and the time element requires careful splitting. Also note that the validation-train plus validation data together make up the full train data.

The index of each data frame is the full date using the ISO 8601 standard. For example, 2020-07-04.

Response

temperature_2m_min

  • [float64] the minimum air temperature at 2 meters above ground for the day

Features

year

  • [int64] year , such as 2020

month

  • [int64] month , such as 10 for October

day

  • [int64] day of the month, for example 20 for January 20

day_of_year

  • [int64] day of the year, for example 100, which in non-leap years in is April 9

For this lab, we require the use of only the year and day_of_year features.

Data in Python

To load the data in Python, use:

import pandas as pd
weather_train = pd.read_parquet(
    "https://cs307.org/lab/data/weather-train.parquet",
)
weather_vtrain = pd.read_parquet(
    "https://cs307.org/lab/data/weather-vtrain.parquet",
)
weather_validation = pd.read_parquet(
    "https://cs307.org/lab/data/weather-validation.parquet",
)
weather_test = pd.read_parquet(
    "https://cs307.org/lab/data/weather-test.parquet",
)

Prepare Data for Machine Learning

Create the X and y variants of the data for use with sklearn:

# create X and y for train
X_train = weather_train[["year", "day_of_year"]]
y_train = weather_train["temperature_2m_min"]

# create X and y for validation-train
X_vtrain = weather_vtrain[["year", "day_of_year"]]
y_vtrain = weather_vtrain["temperature_2m_min"]

# create X and y for validation
X_validation = weather_validation[["year", "day_of_year"]]
y_validation = weather_validation["temperature_2m_min"]

# create X and y for test
X_test = weather_test[["year", "day_of_year"]]
y_test = weather_test["temperature_2m_min"]

You can assume that within the autograder, similar processing is performed on the production data.

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn.

Models

For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like, so long as it meets these requirements:

  • Your model must start from the given training data, unmodified.
    • Importantly, the types and shapes of X_train and y_train should not be changed.
    • In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
    • In the autograder, we will call mod.predict(X_prod) on your model, where your model is loaded as mod and X_prod has a compatible shape with and the same variable names and types as X_train.
  • Your model must have a fit method.
  • Your model must have a predict method.
  • Your model should be created with scikit-learn version 1.6.1 or newer.
  • Your model should be serialized with joblib version 1.4.2 or newer.
  • Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the checks in the autograder.

To obtain the maximum points via the autograder, your model must outperform the following metrics:

Test RMSE: 5.5
Production RMSE: 5.5

Submission

Before submission, especially of your report, you should be sure to review the Lab Policy document.

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.

Footnotes

  1. If you’ve never been, we highly recommend it!↩︎