from joblib import dump
"filename.joblib") dump(model_object,
Lab Policy
The Boring But Important Details
There will be a total of ten labs in CS 307. Each lab will consist of two separate but related assignments:
- Lab Model: You will develop a model and submit it to PrairieLearn.
- Lab Report: You will write a report and submit it to Canvas.
Each lab will involve developing machine learning models for real world data.
Lab Model
The model portion of the lab will consist of two questions on PrairieLearn. It will be graded out of 10 points. Because it is autograded on PrairieLearn, it will allow for buffer points.
The Summary Statistics question will ask you to calculate several numeric summaries of the training data to get you familiar with the lab data.
The Models question will autograde the model that you are asked to develop as a part of the lab.
Model Submission
To save your models for submission to the autograder, use the dump
function from the joblib
library. This process persisting a model to disk is called serialization.
The autograder will only accept a particular filename. This filename will always be given on PrairieLearn and in the lab. Models submitted to the autograder must be less than 5MB on disk.
In general, you will have access to both a train and test set. We will also test your model against additional holdout data, which we will call the production set.
Lab Report
In addition to simply training models, you will also write a lab report using the IMRAD structure. A template Jupyter notebook will be provided.
IMRAD Format
While we are requiring the IMRAD format, this does not imply that you need to write a full academic paper. Stick to the template provided and generally try to be concise.1 You are authorized to plagiarize from the lab document that describes each lab scenario and the associated data.
Introduction
The introduction section should largely state the purpose of the report. That is, it should explain the why and the goal of the report. It should very briefly mention what data and models will be used.
Methods
The methods section should describe what you did. We will break the methods section into two subsections.
Data
The data section should do three things:
- Describe the available data
- Calculate any relevant summary statistics
- Include at least one relevant visualization
Models
The models section should describe the modeling that was performed. When writing, you should not simply state what each line of your Python code does. Instead, you should describe the modeling as if you were describing it to another person.
This section will also collect the code used to train your models.
Results
The results section should plainly state the results, which will often be test metrics that evaluate the performance of your models, but you may certainly consider other statistics or visualizations.
Discussion
Be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! Discuss any limitations or potential improvements. Additionally, include responses to the any discussion prompts stated in the lab document.
The discussion section is by far the most important, both in general, and for your lab grade. It should be given the most consideration, and is likely (but not required) to be the longest section.
Report Submission
After you complete the lab notebook, we recommend the following steps:
- Clear all output.
- Restart the Python kernel.
- Run all cells.
- Preview (render) the notebook.3
Note that each of these corresponds to a button in VSCode when editing a Jupyter Notebook. The preview button may require first clicking the three dots to see more options.
The preview (render) step requires Quarto CLI and the Quarto VSCode Extension. Installing these will allow you to render your Jupyter Notebook to a .html
file using Quarto. This has a number of advantages over the using Jupyter’s export feature.
Following these steps will ensure that once you have submitted, we will very, very likely be able to reproduce your work.
Once you’re ready to submit, head to the relevant lab on Canvas. You are required to submit two files:
lab-xx.ipynb
lab-xx.html
Here xx
should be the two-digit lab number. For example with Lab 01 you will submit:
lab-01.ipynb
lab-01.html
Late Submissions
Unlike other course activities, lab reports are human graded, so no buffer points will be available. Instead, reports may be submitted late, with a 10% reduction per day.
Report submission will allow for unlimited attempts. However, be aware, the human graders will grade whichever version was most recently submitted at the time they choose to grade, which can be any time after the deadline. Importantly, if you submit one version before the deadline, and another after the deadline, they will grade the late version and you will be assessed a late penalty.
Once a grader has graded a report, you may not submit again, even if there are late days remaining. We do not recommend making a submission you are not willing to have graded.
Grading Rubric
Lab Reports will be graded on Canvas out of a possible 10 points. Each of the 10 points will have it’s own rubric item. Each rubric item will be assigned a possible values of 0, 0.5, or 1 corresponding to:
- No issues: 1
- Minor issues: 0.5
- Major issues: 0
The ten rubric items are:
- Are both a rendered
.html
file and a source.ipynb
file submitted? - Are both the source and rendered Jupyter notebook, including the code contained in them, well-formatted?
- Is markdown used correctly?
- Does the markdown render as expected?
- Does code follow PEP 8? While we do not expect students to be code style experts, there are some very basics we would like you to follow:
- No blank lines at the start of cells. No more than one blank line at the end of a cell.
- Spaces around binary operators, except for passing arguments to function parameters.
- Does the introduction reasonably introduce the scenario?
- Can a reader unfamiliar with CS 307 and the specific lab understand why a model is being developed?
- Does the methods section reasonably describe the data and methods used?
- Is a well-formatted visualization included in the data (or results) subsection of the methods section? At minimum every graphic should include:
- A title that uses Title Case.
- A manually labeled x-axis using Title Case and including units if necessary.
- A manually labeled y-axis using Title Case and including units if necessary.
- A legend if plotting multiple categories of things.
- Does the results section provide a reasonable, probably numeric, summary of the model performance?
- Is a conclusion stated in the discussion section that takes into account the lab scenario?
- Specifically, you must state whether or not you would use the model in practice.
- Is a reasonable justification given for the stated conclusion?
- Answer as if you job depends on it. In the future, that might be the case!
- Using a single numeric metric is wholly insufficient, most importantly because it lacks context. You should give serious consideration to what errors can be made by your model, and what the consequences of those errors could be.
- Are each of the additional discussion prompts answered?
- Like the lab report itself, you should not directly answer these. But the answers should be woven into the narrative of your discussion.
- Throughout the discussion section, are course concepts used correctly and appropriately?
Lab Topics
The following is a tentative list of lab topics.
- Lab 00: Using Football (NFL) Data to Predict 4th Down Conversions
- Lab 01: Using Weather Data to Predict Urbana Temperature
- Lab 02: Using Banking Customer Data to Predict Credit Ratings
- Lab 03: Using Baseball (MLB) Data to Classify Pitch Types
- Lab 04: Using Wine Data to Create Wine Ratings
- Lab 05: Using Genetic (Gene Expression) Data to Detect Disease
- Lab 06: Using Financial Transaction Data to Detect Fraud
- Lab 07: Using Text (Airline Tweets) Data for Sentiment Analysis
- Lab 08: Using Basketball (WNBA) Data to Grade Shot Selection
- Lab 09: Using Real Estate (Airbnb) Data to Predict Rental Prices
- Lab 10: Using Image Data to Classify Clothing
Footnotes
You are not Charles Dickens and we are not paying you by the word.↩︎
This is Midwestern for “yes” but enthusiastically.↩︎
Importantly, this is not the export that Jupyter uses by default↩︎