import pandas as pd
Lab 01: Urbana Weather
For Lab 01, you will use weather data to develop a model that will predict the minimum temperature in Urbana, IL for a particular day of the year.
Background
Lincoln Square is home to both the Common Ground Food Co-Operative and the Market at the Square. Market at the Square is the Urbana Farmers market.1
Because the weather in Urbana is not so nice year-round, the market is only outdoors during reasonable weather. As winter approaches, some of the vendors continue in a more limited capacity, but indoors at Lincoln Square.
Scenario and Goal
You are the manager for the Market at the Square, the local Urbana Farmer’s Market. Each year, sometime in Autumn, the market moves from outdoors to indoors. You’d like to be able to reliably predict when to make the move, but well in advance, to give vendors certainty about when the change will take place, as not all vendors make the switch to indoors. You hope to find a model for the minimum daily temperature (as the market opens early in the morning, and vendors arrive even earlier) so that you can predict when it will be too cold to hold the market outdoors.
Data
To achieve the goal of this lab, we will need historical weather data. The necessary data is provided in the following files:
- Train Data:
weather-train.csv
- Validation-Train Data:
weather-vtrain.csv
- Validation Data:
weather-validation.csv
- Test Data:
weather-test.csv
Source
The Urbana Weather data was collected using the Open-Meteo API. Specifically, the Historical Weather API was used.
- Zippenfenig, P. (2023). Open-Meteo.com Weather API [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.7970649
The Historical Weather API is based on reanalysis datasets and uses a combination of weather station, aircraft, buoy, radar, and satellite observations to create a comprehensive record of past weather conditions. These datasets are able to fill in gaps by using mathematical models to estimate the values of various weather variables. As a result, reanalysis datasets are able to provide detailed historical weather information for locations that may not have had weather stations nearby, such as rural areas or the open ocean.
Additional citations specific the the weather models used by the API can be found on the Open-Meteo website.
The Urbana Weather data was accessed using:
- Latitude: 40.1106
- Longitude: -88.2073
On a map, this places the location almost exactly at Lincoln Square.
Open-Meteo provides excellent documentation on their APIs.
The above link will provide detailed information about how to use the API for the Urbana location. It will even automatically generate Python code to make a request to the API and collect the results as a pandas
data frame!
Data Dictionary
The data here is not split randomly. The different datasets are split according to time.
- Train: 2016 - 2021
- Validation-Train: 2016 - 2019
- Validation: 2020 - 2021
- Test: 2022
- Production: 2023
The index of each data frame is the full date using the ISO 8601 standard. For example, 2020-07-04
.
Response
temperature_2m_min
[float64]
the minimum air temperature at 2 meters above ground for the day
Features
year
[int64]
year , such as2020
month
[int64]
month , such as10
for October
day
[int64]
day of the month, for example20
for January 20
day_of_year
[int64]
day of the year, for example100
, which in non-leap years in is April 9
For this lab, we require the use of only the year
and day_of_year
features.
Data in Python
To load the data in Python, use:
= pd.read_csv(
weather_train "https://cs307.org/lab-01/data/weather-train.csv",
="date",
index_col=True
parse_dates
)= pd.read_csv(
weather_vtrain "https://cs307.org/lab-01/data/weather-vtrain.csv",
="date",
index_col=True
parse_dates
)= pd.read_csv(
weather_validation "https://cs307.org/lab-01/data/weather-validation.csv",
="date",
index_col=True
parse_dates
)= pd.read_csv(
weather_test "https://cs307.org/lab-01/data/weather-test.csv",
="date",
index_col=True
parse_dates )
weather_train
temperature_2m_min | year | month | day | day_of_year | |
---|---|---|---|---|---|
date | |||||
2016-01-01 | -4.2715 | 2016 | 1 | 1 | 1 |
2016-01-02 | -3.8715 | 2016 | 1 | 2 | 2 |
2016-01-03 | -4.4715 | 2016 | 1 | 3 | 3 |
2016-01-04 | -3.0215 | 2016 | 1 | 4 | 4 |
2016-01-05 | -5.7715 | 2016 | 1 | 5 | 5 |
... | ... | ... | ... | ... | ... |
2021-12-27 | 6.9980 | 2021 | 12 | 27 | 361 |
2021-12-28 | 1.7980 | 2021 | 12 | 28 | 362 |
2021-12-29 | 2.1980 | 2021 | 12 | 29 | 363 |
2021-12-30 | 0.5980 | 2021 | 12 | 30 | 364 |
2021-12-31 | 4.7480 | 2021 | 12 | 31 | 365 |
2192 rows × 5 columns
Prepare Data for Machine Learning
Create the X
and y
variants of the data for use with sklearn
:
# create X and y for train
= weather_train[["year", "day_of_year"]]
X_train = weather_train["temperature_2m_min"]
y_train
# create X and y for validation-train
= weather_vtrain[["year", "day_of_year"]]
X_vtrain = weather_vtrain["temperature_2m_min"]
y_vtrain
# create X and y for validation
= weather_validation[["year", "day_of_year"]]
X_validation = weather_validation["temperature_2m_min"]
y_validation
# create X and y for test
= weather_test[["year", "day_of_year"]]
X_test = weather_test["temperature_2m_min"] y_test
You can assume that within the autograder, similar processing is performed on the production data.
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.
Models
For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:
- Models must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
. - In the autograder, we will call
mod.predict(X_prod)
on your model, where your model is loaded asmod
andX_prod
has a compatible shape with and the same variable names and types asX_train
.
- Importantly, the types and shapes of
- Your model must have a
fit
method. - Your model must have a
predict
method. - Your model should be created with
scikit-learn
version1.5.1
or newer. - Your model should be serialized with
joblib
version1.4.2
or newer.- Your serialized model must be less than 5MB.
To obtain the maximum points via the autograder, your model performance must meet or exceed:
Test RMSE: 5.5
Production RMSE: 5.5
Model Persistence
To save your model for submission to the autograder, use the dump
function from the joblib
library. Check PrairieLearn for the filename that the autograder expects for this lab.
from joblib import dump
"filename.joblib") dump(mod,
Discussion
As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! Justify your conclusion. If you trained multiple models that are mentioned in your report, first make clear which model you selected and are considering for use in practice.
Additional discussion prompts:
- Does the overall strategy here seem appropriate? Do you have any general weather knowledge that suggests an obvious flaw here?
- Be sure you have read the data background, paying attention to how the data was collected and split.
- Assuming you used KNN, does distance make sense here? What is the distance between two dates in time for this data? Does this actually make sense?
When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!
Template Notebook
Submission
On Canvas, be sure to submit both your source .ipynb
file and a rendered .html
version of the report.
Footnotes
If you’ve never been, we highly recommend it!↩︎