import pandas as pd
Lab 06: Gene Expression
For Lab 06, you will use gene expression data to develop a model that will predict the cancer type of a tissue sample.
Background
Cancer detection is an unfortunate but important reality. Early detection can significantly improve survival. One consistently researched possibility is the use of genetic information to push detection earlier and earlier. The BRCA mutation is an example of a simple genetic screening that can help better estimate the probability of developing breast cancer.
Downstream of DNA itself, gene expression can give insight into the effects of DNA on phenotypical outcomes.
Next-generation sequencing is a constantly evolving set of technologies that can measure gene expression. As these technologies become cheaper to use, and more readily available, they can potentially be used as part of the process of detecting and identifying cancers.
Scenario and Goal
Who are you?
- You are a data scientist working for a small biotechnology startup.
What is your task?
- You are asked to begin to explore the possibility of developing a “universal” cancer detection and classification model, given gene expression data collected via a next-generation sequencing such as RNA-Seq. Your goal is not to create a product that is immediately useful, but instead, to simply work towards a proof of concept.
Who are you writing for?
- To summarize your work, you will write a report for your manager, which in this case, is the CEO and founder of the startup.You can assume your manager is very familiar with biology and related technologies, and reasonable familiar with the general concepts of machine learning. They have worked with groups who have places machine learning models into practice in the past.
Data
To achieve the goal of this lab, we will need gene expression and clinical outcome data. The necessary data is provided in the following files:
Source
The underlying source of this data is the The Cancer Genome Atlas Pan-Cancer Analysis Project. The data was accessed via synapse.org.
The specific data for this lab was collected and modified based on a submission to the UCI Irvine Machine Learning Repository.
Data Dictionary
Each observation in the train, test, and (hidden) production data contains clinical and gene expression information from a tissue sample of a cancer patient.
Response
cancer
[object]
the clinically determined cancer type, one of:BRCA
: Breast Invasive CarcinomaPRAD
: Prostate AdenocarcinomaKIRC
: Kidney Renal Clear Cell CarcinomaLUAD
: Lung AdenocarcinomaCOAD
: Colon Adenocarcinoma
Features
gene_
[float64]
gene expression quantification as measured by an Illumina HiSeq platform
Data in Python
To load the data in Python, use:
= pd.read_parquet(
cancer_train "https://cs307.org/lab-06/data/cancer-train.parquet",
)= pd.read_parquet(
cancer_test "https://cs307.org/lab-06/data/cancer-test.parquet",
)
cancer_train
cancer | gene_0 | gene_1 | gene_2 | gene_3 | gene_4 | gene_5 | gene_6 | gene_7 | gene_8 | ... | gene_1990 | gene_1991 | gene_1992 | gene_1993 | gene_1994 | gene_1995 | gene_1996 | gene_1997 | gene_1998 | gene_1999 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BRCA | 0.0 | 3.149861 | 1.913454 | 5.562355 | 9.638586 | 0.0 | 4.302421 | 0.511670 | 0.000000 | ... | 7.300691 | 8.383307 | 1.187198 | 0.000000 | 7.350471 | 0.0 | 4.634587 | 7.082415 | 9.727447 | 1.187198 |
1 | LUAD | 0.0 | 6.237034 | 5.043235 | 6.297397 | 10.391415 | 0.0 | 7.669941 | 0.913033 | 0.000000 | ... | 6.322446 | 7.815595 | 13.809095 | 0.913033 | 7.651052 | 0.0 | 7.476074 | 4.733739 | 8.510863 | 0.000000 |
2 | BRCA | 0.0 | 3.856896 | 2.394981 | 6.758277 | 9.585513 | 0.0 | 7.409009 | 1.242023 | 0.000000 | ... | 6.845515 | 9.194823 | 5.667696 | 0.000000 | 7.748253 | 0.0 | 5.567421 | 5.203158 | 7.364879 | 0.000000 |
3 | PRAD | 0.0 | 4.279924 | 3.606963 | 5.706613 | 9.716581 | 0.0 | 8.244226 | 0.402613 | 0.000000 | ... | 6.598611 | 8.199118 | 5.024218 | 0.000000 | 7.607907 | 0.0 | 5.705281 | 6.278007 | 9.725383 | 0.000000 |
4 | BRCA | 0.0 | 3.359788 | 4.199986 | 6.144766 | 9.141834 | 0.0 | 9.014135 | 1.061776 | 0.626486 | ... | 7.062651 | 9.670708 | 4.243707 | 0.000000 | 8.156811 | 0.0 | 7.363487 | 5.384844 | 8.703443 | 1.894876 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
123 | KIRC | 0.0 | 3.069737 | 3.623200 | 6.744955 | 9.591219 | 0.0 | 7.254254 | 0.000000 | 0.000000 | ... | 7.333665 | 8.365334 | 5.481444 | 0.000000 | 7.307674 | 0.0 | 5.723447 | 5.401791 | 9.067881 | 0.000000 |
124 | BRCA | 0.0 | 3.534497 | 3.064866 | 6.638882 | 10.010206 | 0.0 | 7.899387 | 0.000000 | 0.000000 | ... | 7.080615 | 9.968912 | 10.501041 | 0.000000 | 8.866052 | 0.0 | 5.276806 | 4.941073 | 8.548240 | 0.000000 |
125 | BRCA | 0.0 | 4.087463 | 3.786596 | 6.385845 | 9.544964 | 0.0 | 8.062856 | 0.000000 | 0.000000 | ... | 5.842979 | 9.931033 | 10.234817 | 0.000000 | 8.384568 | 0.0 | 7.033423 | 5.749534 | 8.132371 | 0.000000 |
126 | LUAD | 0.0 | 3.272889 | 4.529234 | 7.134909 | 9.504362 | 0.0 | 5.668893 | 0.000000 | 0.000000 | ... | 7.300966 | 8.040350 | 11.972872 | 0.000000 | 7.592607 | 0.0 | 5.026256 | 6.137704 | 8.780947 | 1.167936 |
127 | KIRC | 0.0 | 3.217851 | 2.142315 | 6.024548 | 9.537583 | 0.0 | 7.910391 | 1.974419 | 0.000000 | ... | 7.372552 | 8.358203 | 4.966864 | 0.406537 | 7.673309 | 0.0 | 5.561448 | 5.163124 | 8.101991 | 1.849679 |
128 rows × 2001 columns
Prepare Data for Machine Learning
Create the X
and y
variants of the data for use with sklearn
:
# create X and y for train
= cancer_train.drop(columns=["cancer"])
X_train = cancer_train["cancer"]
y_train
# create X and y for test
= cancer_test.drop(columns=["cancer"])
X_test = cancer_test["cancer"] y_test
You can assume that within the autograder, similar processing is performed on the production data.
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.
Models
For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:
- Models must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
. - In the autograder, we will call
mod.predict(X_prod)
on your model, where your model is loaded asmod
andX_prod
has a compatible shape with and the same variable names and types asX_train
. - We assume that you will use a
Pipeline
andGridSearchCV
fromsklearn
as you will need to deal with heterogeneous data, and you should be using cross-validation.- So more specifically, you should create a
Pipeline
that is fit withGridSearchCV
. Done correctly, this will store a “model” that you can submit to the autograder.
- So more specifically, you should create a
- Importantly, the types and shapes of
- Your model must have a
fit
method. - Your model must have a
predict
method. - Your model must have a
predict_proba
method. - Your model should be created with
scikit-learn
version1.5.2
or newer. - Your model should be serialized with
joblib
version1.4.2
or newer.- Your serialized model must be less than 5MB.
To obtain the maximum points via the autograder, your model performance must meet or exceed:
Test Accuracy: 1.0
Production Accuracy: 0.99
Model Persistence
To save your model for submission to the autograder, use the dump
function from the joblib
library. Check PrairieLearn for the filename that the autograder expects for this lab.
from joblib import dump
"filename.joblib") dump(mod,
Discussion
As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! Justify your conclusion. If you trained multiple models that are mentioned in your report, first make clear which model you selected and are considering for use in practice.
Additional discussion prompts:
- Assuming you develop a promising model (which is likely), what next steps would you take to further investigate the possibility of a universal cancer detector and classifier?
When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!
Template Notebook
Submission
On Canvas, be sure to submit both your source .ipynb
file and a rendered .html
version of the report.