Lab Policy

Models Meet Data

Introduction

Labs in CS 307 are split into two related assignments.1

  • Lab Model: You will develop a model and submit it to PrairieLearn.
  • Lab Report: You will write a report and submit it to Canvas.

Each lab will involve developing machine learning models for a real world situation using real world data.

  • The model you submit will be graded based on its performance.
  • The report you submit will be graded based on its ability to communicate the purpose, performance, and usability of the model.

Lab Model

The model assignment of the lab will consist of two questions on PrairieLearn.

  • The Summary Statistics question will ask you to calculate several numeric summaries of the training data.
  • The Model question will autograde a model that you are asked to develop.

Model Requirements

When developing a model for labs, you may use any modeling techniques you’d like, so long as it meets these requirements:

  • Your model must start from the provided training data, unmodified.
    • Importantly, the types and shapes of X_train and y_train should not be changed.
    • In the autograder, we will call mod.predict(X_test) on your submitted model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as the provided X_train.
    • In the autograder, we will call mod.predict(X_prod) on your submitted model, where your model is loaded as mod and X_prod has a compatible shape with and the same variable names and types as the provided X_train.
    • If preprocessing is necessary, it should be included in your model via a pipeline.
  • Your model must have a fit method.
  • Your model must have a predict method.
  • If the lab is a classification task, your model must have a predict_proba method.
  • Your model must be developed using Python 3.13.1 or newer.
  • Your model must be created with scikit-learn version 1.7.1 or newer.
  • Your model must be serialized with joblib version 1.5.2 or newer.
  • Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the checks in the autograder.

To ensure that your model is developed using the correct version of Python and Python packages, recall the Computing Policy document. In particular, if you use the provided pyproject.toml for CS 307, those requires should automatically be met!

Model Submission

To save your models for submission to the autograder, use the dump function from the joblib library. This process of persisting a model to disk is called serialization.

from joblib import dump
dump(model_object, "filename.joblib")

The autograder will only accept a particular filename. Models submitted to the autograder must be less than 5MB on disk. For particularly large models, you may use the compress parameter of the dump function to reduce the size of the model when written to disk.

There will be a very non-trivial timeout between possible model submissions to the autograder. Do not expect to make multiple submissions near a deadline. More than anything else, the long timeout should encourage you to seriously consider model validation before submission.

Submissions for the summary statistics question will have a more modest timeout.

In general, you will have access to both a train and test set. We will also evaluate your model with additional holdout data, which we will call the production set. You will not have access to the production data.

Lab Report

In addition to simply developing models, you will also write a lab report using the IMRAD structure.

Template Notebook

We recommend using the following template notebook to start each of your reports:

The template notebook contains code cells with comments that suggest additional organization beyond the provided headings for the IMRAD format. You are free to delete or add code cells as needed.

The second cell of the notebook contains a YAML header that Quarto will use when you render your document.

---
embed-resources: true
echo: false
---

Do not modify this cell. If you are not using the template, you must include this YAML header in your document.

IMRAD Format

While we require the IMRAD format, that does not imply that you need to write an academic paper. Stick to the template provided and generally try to be concise.2 You are authorized to plagiarize from the lab instructions that describe the lab scenario and associated data.

In general, when writing your report, write as if the lab prompt did not exist, and assume the reader is wholly unfamiliar with CS 307 or the assignment you are completing. They will have some familiarity with the domain of the problem depending on the given background and goal.

Is this a non-trivial amount of “extra” work? Yes. Is it worth it? You betcha!3

Introduction

The introduction section should state the purpose of the report. It should explain the why and the goal of the report. It should very briefly mention what data and models will be used.

Methods

The methods section should describe what you did and how you did it. We will break the methods section into two subsections.

Data

The data section should do three things:

  • Describe the available data
  • Calculate and report any relevant summary statistics
  • Include at least one relevant visualization

To ensure that you have properly described the data, you should include a full data dictionary.

Modeling

The modeling section should describe the modeling procedures that was performed. You should not simply state what each line of your Python code does. Instead, you should describe the modeling as if you were describing it to another person.

Results

The results section should plainly state the results, which will often be test metrics that evaluate the performance of your models.

You must also include one visualization in the results (or discussion) section. This visualization should help communicate the performance or usability of your chosen model.

Do not report or make any decisions based on the production information reported in the autograder. You are writing this report to largely communicate if you would put your model in production or not. You wouldn’t have production metrics before putting the model in production! The production data is used partially to illustrate “making predictions for new data” but also to prevent cheating to obtain required test metrics.

Discussion

Be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab!

Specifically, if you choose to put your model into practice:

  • What benefit does the model provide?
  • What limitations should be considered?

Or, if you choose to not put your model into practice:

  • What risks are avoided by not using the model?
  • What improvements would be necessary to consider the model for usage?

The discussion section is by far the most important, both in general, and for your lab grade. It should be given the most consideration, and is likely (but not required) to be the longest section.

Report Submission

After you complete your lab notebook, we recommend the following steps:

  1. Clear all output.
  2. Restart the Python kernel.
  3. Run all cells.
  4. Save the notebook.
  5. Render the notebook with Quarto.4

To render a notebook using Quarto, we recommend the following command, substituting the correct lab number:

uv run quarto preview lab-00.ipynb

Be sure that you have opened your CS 307 folder in VSCode, otherwise this command may not work as expected.

Following these steps will ensure that once you have submitted, we will very, very likely be able to reproduce your work.

Before submission, you should open the HTML file you created in a web browser and verify that it looks correct.

Once you’re ready to submit, head to the relevant lab on Canvas. You are required to submit two files:

  1. lab-xx.ipynb
  2. lab-xx.html

Here xx should be the two-digit lab number. For example with Lab 01 you will submit:

  1. lab-01.ipynb
  2. lab-01.html

After submitting to Canvas, please spend an extra minute to double check that your submission was accepted!

Grading Rubric

Lab Reports will be graded on Canvas out of a possible 15 points. Each of the 15 points will have it’s own rubric item. Each rubric item will be assigned a possible value of 0, 0.5, or 1 corresponding to:

  • No issues: 1
  • Minor issues: 0.5
  • Major issues: 0

Rubric Items

  1. Is the source .ipynb notebook submitted?
  2. Is a rendered .html report submitted?
  3. Is the .html file properly rendered via Quarto?
    • No points will be granted if the file is rendered via Jupyter.
  4. Are both the source notebook and rendered report, including the code contained in them, well-formatted?
    • Is markdown used correctly?
    • Does the markdown render as expected?
    • Are all warnings and messages suppressed from the rendered report?
    • Is code mostly hidden from the rendered report, except where truly useful for narrative or explanation?
    • Does code follow PEP 8? While we do not expect students to be code style experts, there are some very basics we would like you to follow:
      • No blank lines at the start of cells. No more than one blank line at the end of a cell.
      • Spaces around binary operators, except for passing arguments to function parameters.
  5. Does the report have a title?
    • Does the title use (a reasonable variant of) Title Case?
  6. Does the introduction reasonably introduce the scenario?
    • Can a reader unfamiliar with CS 307 and the specific lab understand why a model is being developed?
  7. Does the methods section reasonably describe the data used?
    • Is a data dictionary, describing the target and each feature, included?
  8. Does the methods section reasonably describe model development?
    • Include information on models considered, parameters considered, tuning and selection procedures, and any other methods used during model development.
  9. Is a well-formatted exploratory visualization included in the data subsection of the methods section?
    • Does the visualization provide some useful insight that informs modeling or interpretation?
    • At minimum, a well-formatted visualization should include:
      • A manually labeled \(x\)-axis using Title Case, including units if necessary.
      • A manually labeled \(y\)-axis using Title Case, including units if necessary.
      • A legend if plotting multiple categories of things.
      • A figure caption created using Quarto that describes the visualization.
  10. Does the results section provide a reasonable summary of the selected model’s performance?
  11. Is a well-formatted summary figure included in the results (or discussion) section?
    • Does the figure provide some insight into the performance or usability of the model?
    • At minimum, a well-formatted visualization should include:
      • A manually labeled \(x\)-axis using Title Case, including units if necessary.
      • A manually labeled \(y\)-axis using Title Case, including units if necessary.
      • A legend if plotting multiple categories of things.
      • A figure caption created using Quarto that describes the visualization.
  12. Is a conclusion stated in the discussion section?
    • Specifically, you must explicitly state whether or not you would use the model in practice.
  13. Does the conclusion have a reasonable justification?
    • Does the conclusion and justification consider the lab scenario?
    • Answer as if you job depends on it. In the future, that might be the case!
    • Using a single numeric metric is wholly insufficient, most importantly because it lacks context. You should give serious consideration to what errors can be made by your model, and what the consequences of those errors could be.
  14. Are the specifics of the conclusion included in the discussion?
    • Are the benefits and limitations discussed if you choose to use the model?
    • Are the risks and improvements discussed if you choose to not use the model?
  15. Throughout the discussion section, are course concepts used correctly and appropriately?

Footnotes

  1. The last lab, Lab 06, will have only the model portion.↩︎

  2. You are not Charles Dickens and we are not paying you by the word.↩︎

  3. This is Midwestern for “yes” but enthusiastically.↩︎

  4. Importantly, this is not the export that Jupyter uses by default↩︎