import pandas as pd
Lab 05: Gene Expression
Scenario: For this lab, we are downplaying the scenario, as you will not need to write a report, and the data is optimistically simulated. Your main focus on this lab is tuning a high performing model. The data here could be viewed as gene expression for the features, and age for the response. Thus, you could imagine using gene expression to predict age. For a real-world example of this scenario:
Also, the following review paper shows an overview of the application of machine learning techniques in genomics:
Yes, you read correctly, no “real” report to write for this lab. But for grading purposes, you will still submit a blank template document. This way, we hope you can complete the autograded portion of the lab and simply be done with the lab before Spring Break!
Goal
The goal of this lab is to create a regression model that predicts the age of an human individual given gene expression data.
Data
This lab will use data simulated specifically for this lab.
Response
y
[int64]
the age of the individual
Features
There are many features in this data, each representing a gene in the human genome. They are each given a fake name that is a random sequence of ten lowercase letters. Each takes the form:
wjylzzdntx
[float64]
the gene expression for the fake gene namedwjylzzdntx
as fake quantified using RNA-seq technology
Data in Python
To load the data in Python, use:
= pd.read_csv("https://cs307.org/lab-05/data/genes-train.csv") genes_train
To create the X
and y
variants of the training data, use:
# create X and y for train
= genes_train.drop("y", axis=1)
X_train = genes_train["y"] y_train
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.
Models
For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:
- Models must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
. - We assume that you will use a
Pipeline
andGridSearchCV
fromsklearn
as you will need to deal with heterogeneous data, and you should be using cross-validation.- So more specifically, you should create a
Pipeline
that is fit withGridSearchCV
. Done correctly, this will store a “model” that you can submit to the autograder.
- So more specifically, you should create a
- Importantly, the types and shapes of
- Your model must have a
fit
method. - Your model must have a
predict
method that returns numbers. - Your model should be created with
scikit-learn
version1.4.0
or newer. - Your model should be serialized with
joblib
version1.3.2
or newer.- Your serialized model must be less than 5MB.
To obtain the maximum points via the autograder, your model performance must meet or exceed:
Test RMSE: 0.325
Model Persistence
To save your model for submission to the autograder, use the dump
function from the joblib
library. Check PrairieLearn for the filename that the autograder expects.
Discussion
No discussion this week! No report! Just submit the template notebook (.ipynb
and rendered .html
) to Canvas and enjoy Spring Break!
Template Notebook
Submission
On Canvas, be sure to submit both your source .ipynb
file and a rendered .html
version of the report.