Lab 06: Gene Expression

For Lab 06, you will use gene expression data to develop a model that will predict the cancer type of a tissue sample.

Background

Cancer detection is an unfortunate but important reality. Early detection can significantly improve survival. One consistently researched possibility is the use of genetic information to push detection earlier and earlier. The BRCA mutation is an example of a simple genetic screening that can help better estimate the probability of developing breast cancer.

Downstream of DNA itself, gene expression can give insight into the effects of DNA on phenotypical outcomes.

Next-generation sequencing is a constantly evolving set of technologies that can measure gene expression. As these technologies become cheaper to use, and more readily available, they can potentially be used as part of the process of detecting and identifying cancers.

Scenario and Goal

Who are you?

  • You are a data scientist working for a small biotechnology startup.

What is your task?

  • You are asked to begin to explore the possibility of developing a “universal” cancer detection and classification model, given gene expression data collected via a next-generation sequencing such as RNA-Seq. Your goal is not to create a product that is immediately useful, but instead, to simply work towards a proof of concept.

Who are you writing for?

  • To summarize your work, you will write a report for your manager, which in this case, is the CEO and founder of the startup.You can assume your manager is very familiar with biology and related technologies, and reasonable familiar with the general concepts of machine learning. They have worked with groups who have places machine learning models into practice in the past.

Data

To achieve the goal of this lab, we will need gene expression and clinical outcome data. The necessary data is provided in the following files:

Source

The underlying source of this data is the The Cancer Genome Atlas Pan-Cancer Analysis Project. The data was accessed via synapse.org.

The specific data for this lab was collected and modified based on a submission to the UCI Irvine Machine Learning Repository.

Data Dictionary

Each observation in the train, test, and (hidden) production data contains clinical and gene expression information from a tissue sample of a cancer patient.

Response

cancer

  • [object] the clinically determined cancer type, one of:
    • BRCA: Breast Invasive Carcinoma
    • PRAD: Prostate Adenocarcinoma
    • KIRC: Kidney Renal Clear Cell Carcinoma
    • LUAD: Lung Adenocarcinoma
    • COAD: Colon Adenocarcinoma

Features

gene_

  • [float64] gene expression quantification as measured by an Illumina HiSeq platform

Data in Python

To load the data in Python, use:

import pandas as pd
cancer_train = pd.read_parquet(
    "https://cs307.org/lab-06/data/cancer-train.parquet",
)
cancer_test = pd.read_parquet(
    "https://cs307.org/lab-06/data/cancer-test.parquet",
)
cancer_train
cancer gene_0 gene_1 gene_2 gene_3 gene_4 gene_5 gene_6 gene_7 gene_8 ... gene_1990 gene_1991 gene_1992 gene_1993 gene_1994 gene_1995 gene_1996 gene_1997 gene_1998 gene_1999
0 BRCA 0.0 3.149861 1.913454 5.562355 9.638586 0.0 4.302421 0.511670 0.000000 ... 7.300691 8.383307 1.187198 0.000000 7.350471 0.0 4.634587 7.082415 9.727447 1.187198
1 LUAD 0.0 6.237034 5.043235 6.297397 10.391415 0.0 7.669941 0.913033 0.000000 ... 6.322446 7.815595 13.809095 0.913033 7.651052 0.0 7.476074 4.733739 8.510863 0.000000
2 BRCA 0.0 3.856896 2.394981 6.758277 9.585513 0.0 7.409009 1.242023 0.000000 ... 6.845515 9.194823 5.667696 0.000000 7.748253 0.0 5.567421 5.203158 7.364879 0.000000
3 PRAD 0.0 4.279924 3.606963 5.706613 9.716581 0.0 8.244226 0.402613 0.000000 ... 6.598611 8.199118 5.024218 0.000000 7.607907 0.0 5.705281 6.278007 9.725383 0.000000
4 BRCA 0.0 3.359788 4.199986 6.144766 9.141834 0.0 9.014135 1.061776 0.626486 ... 7.062651 9.670708 4.243707 0.000000 8.156811 0.0 7.363487 5.384844 8.703443 1.894876
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
123 KIRC 0.0 3.069737 3.623200 6.744955 9.591219 0.0 7.254254 0.000000 0.000000 ... 7.333665 8.365334 5.481444 0.000000 7.307674 0.0 5.723447 5.401791 9.067881 0.000000
124 BRCA 0.0 3.534497 3.064866 6.638882 10.010206 0.0 7.899387 0.000000 0.000000 ... 7.080615 9.968912 10.501041 0.000000 8.866052 0.0 5.276806 4.941073 8.548240 0.000000
125 BRCA 0.0 4.087463 3.786596 6.385845 9.544964 0.0 8.062856 0.000000 0.000000 ... 5.842979 9.931033 10.234817 0.000000 8.384568 0.0 7.033423 5.749534 8.132371 0.000000
126 LUAD 0.0 3.272889 4.529234 7.134909 9.504362 0.0 5.668893 0.000000 0.000000 ... 7.300966 8.040350 11.972872 0.000000 7.592607 0.0 5.026256 6.137704 8.780947 1.167936
127 KIRC 0.0 3.217851 2.142315 6.024548 9.537583 0.0 7.910391 1.974419 0.000000 ... 7.372552 8.358203 4.966864 0.406537 7.673309 0.0 5.561448 5.163124 8.101991 1.849679

128 rows × 2001 columns

Prepare Data for Machine Learning

Create the X and y variants of the data for use with sklearn:

# create X and y for train
X_train = cancer_train.drop(columns=["cancer"])
y_train = cancer_train["cancer"]

# create X and y for test
X_test = cancer_test.drop(columns=["cancer"])
y_test = cancer_test["cancer"]

You can assume that within the autograder, similar processing is performed on the production data.

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.

Models

For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:

  • Models must start from the given training data, unmodified.
    • Importantly, the types and shapes of X_train and y_train should not be changed.
    • In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
    • In the autograder, we will call mod.predict(X_prod) on your model, where your model is loaded as mod and X_prod has a compatible shape with and the same variable names and types as X_train.
    • We assume that you will use a Pipeline and GridSearchCV from sklearn as you will need to deal with heterogeneous data, and you should be using cross-validation.
      • So more specifically, you should create a Pipeline that is fit with GridSearchCV. Done correctly, this will store a “model” that you can submit to the autograder.
  • Your model must have a fit method.
  • Your model must have a predict method.
  • Your model must have a predict_proba method.
  • Your model should be created with scikit-learn version 1.5.2 or newer.
  • Your model should be serialized with joblib version 1.4.2 or newer.
    • Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the checks in the autograder.

To obtain the maximum points via the autograder, your model performance must meet or exceed:

Test Accuracy: 1.0
Production Accuracy: 0.99

Model Persistence

To save your model for submission to the autograder, use the dump function from the joblib library. Check PrairieLearn for the filename that the autograder expects for this lab.

from joblib import dump
dump(mod, "filename.joblib")

Discussion

As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! Justify your conclusion. If you trained multiple models that are mentioned in your report, first make clear which model you selected and are considering for use in practice.

Additional discussion prompts:

  • Assuming you develop a promising model (which is likely), what next steps would you take to further investigate the possibility of a universal cancer detector and classifier?

When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!

Template Notebook

Submission

Before submission, especially of your report, you should be sure to review the Lab Policy page.

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.