import pandas as pd
Lab 02: Weather
For Lab 02, you will use weather data to develop a model that will predict the minimum temperature in Urbana, IL for a particular day of the year.
Background
Lincoln Square is home to both the Common Ground Food Co-Operative and the Market at the Square. Market at the Square is the Urbana Farmers market.1
Because the weather in Urbana is not so nice year-round, the market is only held outdoors during reasonable weather. As winter approaches, some of the vendors continue in a more limited capacity, but indoors at Lincoln Square.
Scenario and Goal
Who are you?
- You work as staff for the City of Urbana. The current date is January 1, 2024.
What is your task?
- The Office of the Mayor has asked you to provide an estimate of when the market should move indoors in 2025 towards the end of the year as winter arrives.
Who are you writing for?
- To summarize your work, you will write a report for the Office of the Mayor, to be shared with the full City Council. Because theses individuals are not familiar with machine learning, you will need to be sure to make your recommendation clear, without using too much ML jargon. The Office of the Mayor and the City Council are of course familiar with the Market at the Square and the freezing temperatures that arrive in Urbana during late Fall and early Winter at the end of the calendar year.
Data
To achieve the goal of this lab, we will need historical weather data. The necessary data is provided in the following files:
- Train Data:
weather-train.parquet
- (Validation) Train Data:
weather-vtrain.parquet
- Validation Data:
weather-validation.parquet
- Test Data:
weather-test.parquet
Source
The Urbana weather data was collected using the Open-Meteo API. Specifically, the Historical Weather API was used.
- Zippenfenig, P. (2023). Open-Meteo.com Weather API [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.7970649
The Historical Weather API is based on reanalysis datasets and uses a combination of weather stations, aircraft, buoy, radar, and satellite observations to create a comprehensive record of past weather conditions. These datasets are able to fill in gaps by using mathematical models to estimate the values of various weather variables. As a result, reanalysis datasets are able to provide detailed historical weather information for locations that may not have had weather stations nearby, such as rural areas or the open ocean.
Additional citations specific the the weather models used by the API can be found on the Open-Meteo website.
The Urbana data was accessed using:
- Latitude: 40.1106
- Longitude: -88.2073
On a map, this places the location almost exactly at Lincoln Square.
Open-Meteo provides excellent documentation for their APIs.
The above link will provide detailed information about how to use the API for the Urbana location. It will automatically generate Python code to make a request to the API and collect the results as a pandas
data frame!
Data Dictionary
The provided data is not split randomly. The data is split according by time, specifically based on years.
- Train: 2016 - 2022
- Validation-Train: 2016 - 2020
- Validation: 2021 - 2022
- Test: 2023
- Production: 2024
The index of each data frame is the full date using the ISO 8601 standard. For example, 2020-07-04
.
Response
temperature_2m_min
[float64]
the minimum air temperature at 2 meters above ground for the day
Features
year
[int64]
year , such as2020
month
[int64]
month , such as10
for October
day
[int64]
day of the month, for example20
for January 20
day_of_year
[int64]
day of the year, for example100
, which in non-leap years in is April 9
For this lab, we require the use of only the year
and day_of_year
features.
Data in Python
To load the data in Python, use:
= pd.read_parquet(
weather_train "https://cs307.org/lab/data/weather-train.parquet",
)= pd.read_parquet(
weather_vtrain "https://cs307.org/lab/data/weather-vtrain.parquet",
)= pd.read_parquet(
weather_validation "https://cs307.org/lab/data/weather-validation.parquet",
)= pd.read_parquet(
weather_test "https://cs307.org/lab/data/weather-test.parquet",
)
Prepare Data for Machine Learning
Create the X
and y
variants of the data for use with sklearn
:
# create X and y for train
= weather_train[["year", "day_of_year"]]
X_train = weather_train["temperature_2m_min"]
y_train
# create X and y for validation-train
= weather_vtrain[["year", "day_of_year"]]
X_vtrain = weather_vtrain["temperature_2m_min"]
y_vtrain
# create X and y for validation
= weather_validation[["year", "day_of_year"]]
X_validation = weather_validation["temperature_2m_min"]
y_validation
# create X and y for test
= weather_test[["year", "day_of_year"]]
X_test = weather_test["temperature_2m_min"] y_test
You can assume that within the autograder, similar processing is performed on the production data.
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn.
Models
For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like, so long as it meets these requirements:
- Your model must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
. - In the autograder, we will call
mod.predict(X_prod)
on your model, where your model is loaded asmod
andX_prod
has a compatible shape with and the same variable names and types asX_train
.
- Importantly, the types and shapes of
- Your model must have a
fit
method. - Your model must have a
predict
method. - Your model should be created with
scikit-learn
version1.6.1
or newer. - Your model should be serialized with
joblib
version1.4.2
or newer. - Your serialized model must be less than 5MB.
To obtain the maximum points via the autograder, your model must outperform the following metrics:
Test RMSE: 5.5
Production RMSE: 5.5
Submission
On Canvas, be sure to submit both your source .ipynb
file and a rendered .html
version of the report.
Footnotes
If you’ve never been, we highly recommend it!↩︎