Lab 05: Gene Expression

Scenario: For this lab, we are downplaying the scenario, as you will not need to write a report, and the data is optimistically simulated. Your main focus on this lab is tuning a high performing model. The data here could be viewed as gene expression for the features, and age for the response. Thus, you could imagine using gene expression to predict age. For a real-world example of this scenario:

Also, the following review paper shows an overview of the application of machine learning techniques in genomics:

Yes, you read correctly, no “real” report to write for this lab. But for grading purposes, you will still submit a blank template document. This way, we hope you can complete the autograded portion of the lab and simply be done with the lab before Spring Break!

Goal

The goal of this lab is to create a regression model that predicts the age of an human individual given gene expression data.

Data

This lab will use data simulated specifically for this lab.

The data for this lab were rather optimistically generated. Do not map your findings here into the real world, as in the real world the data will be much more noisy. Additionally, you should pause before putting such a model into practice in the real world. Is it a useful or ethical thing to do?

Response

y

  • [int64] the age of the individual

Features

There are many features in this data, each representing a gene in the human genome. They are each given a fake name that is a random sequence of ten lowercase letters. Each takes the form:

wjylzzdntx

  • [float64] the gene expression for the fake gene named wjylzzdntx as fake quantified using RNA-seq technology

Data in Python

To load the data in Python, use:

import pandas as pd
genes_train = pd.read_csv("https://cs307.org/lab-05/data/genes-train.csv")

To create the X and y variants of the training data, use:

# create X and y for train
X_train = genes_train.drop("y", axis=1)
y_train = genes_train["y"]

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.

Models

For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:

  • Models must start from the given training data, unmodified.
    • Importantly, the types and shapes of X_train and y_train should not be changed.
    • In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
    • We assume that you will use a Pipeline and GridSearchCV from sklearn as you will need to deal with heterogeneous data, and you should be using cross-validation.
      • So more specifically, you should create a Pipeline that is fit with GridSearchCV. Done correctly, this will store a “model” that you can submit to the autograder.
  • Your model must have a fit method.
  • Your model must have a predict method that returns numbers.
  • Your model should be created with scikit-learn version 1.4.0 or newer.
  • Your model should be serialized with joblib version 1.3.2 or newer.
    • Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the checks in the autograder.

To obtain the maximum points via the autograder, your model performance must meet or exceed:

Test RMSE: 0.325

Model Persistence

To save your model for submission to the autograder, use the dump function from the joblib library. Check PrairieLearn for the filename that the autograder expects.

Discussion

No discussion this week! No report! Just submit the template notebook (.ipynb and rendered .html) to Canvas and enjoy Spring Break!

Template Notebook

Submission

Before submission, especially of your report, you should be sure to review the Lab Policy page!

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.