Lab 07: Genetics

For Lab 07, you will use genetic data to develop a model that will predict the cancer type of a tissue sample.

Background

Cancer detection is an unfortunate but important reality. Early detection can significantly improve survival. One consistently researched possibility is the use of genetic information to push detection earlier and earlier. The BRCA mutation is an example of a simple genetic screening that can help better estimate the probability of developing breast cancer.

Downstream of DNA itself, gene expression can give insight into the effects of DNA on phenotypical outcomes.

Next-generation sequencing is a constantly evolving set of technologies that can measure gene expression. As these technologies become cheaper to use, and more readily available, they can potentially be used as part of the process of detecting and identifying cancers.

Scenario and Goal

Who are you?

  • You are a data scientist working for a small biotechnology startup.

What is your task?

  • You are asked to begin to explore the possibility of developing a “universal” cancer detection and classification model, given gene expression data collected via a next-generation sequencing such as RNA-Seq. Your goal is not to create a product that is immediately useful, but instead, to simply work towards a proof of concept.

Who are you writing for?

  • To summarize your work, you will write a report for your manager, which in this case, is the CEO and founder of the startup.You can assume your manager is very familiar with biology and related technologies, and reasonable familiar with the general concepts of machine learning. They have worked with groups who have places machine learning models into practice in the past.

Data

To achieve the goal of this lab, we will need gene expression and clinical outcome data. The necessary data is provided in the following files:

Source

The underlying source of this data is the The Cancer Genome Atlas Pan-Cancer Analysis Project. The data was accessed via synapse.org.

The specific data for this lab was collected and modified based on a submission to the UCI Irvine Machine Learning Repository.

Data Dictionary

Each observation in the train, test, and (hidden) production data contains clinical and gene expression information from a tissue sample of a cancer patient.

Response

cancer

  • [object] the clinically determined cancer type, one of:
    • BRCA: Breast Invasive Carcinoma
    • PRAD: Prostate Adenocarcinoma
    • KIRC: Kidney Renal Clear Cell Carcinoma
    • LUAD: Lung Adenocarcinoma
    • COAD: Colon Adenocarcinoma

Features

gene_####

  • [float64] gene expression (for gene number #### in the dataset) quantification as measured by an Illumina HiSeq platform

Data in Python

To load the data in Python, use:

import pandas as pd
genetics_train = pd.read_parquet(
    "https://cs307.org/lab/data/genetics-train.parquet",
)
genetics_test = pd.read_parquet(
    "https://cs307.org/lab/data/genetics-test.parquet",
)

Prepare Data for Machine Learning

Create the X and y variants of the data for use with sklearn:

# create X and y for train
X_train = genetics_train.drop(columns=["cancer"])
y_train = genetics_train["cancer"]

# create X and y for test
X_test = genetics_test.drop(columns=["cancer"])
y_test = genetics_test["cancer"]

You can assume that within the autograder, similar processing is performed on the production data.

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn.

Models

For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like, so long as it meets these requirements:

  • Your model must start from the given training data, unmodified.
    • Importantly, the types and shapes of X_train and y_train should not be changed.
    • In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
    • In the autograder, we will call mod.predict(X_prod) on your model, where your model is loaded as mod and X_prod has a compatible shape with and the same variable names and types as X_train.
    • We assume that you will use a Pipeline and GridSearchCV from sklearn as you will need to deal with heterogeneous data, and you should be using cross-validation to tune your model.
      • More specifically, you should create a Pipeline that is fit with GridSearchCV. Done correctly, this will store a tuned model that you can submit to the autograder.
  • Your model must have a fit method.
  • Your model must have a predict method.
  • Your model must have a predict_proba method.
  • Your model should be created with scikit-learn version 1.6.1 or newer.
  • Your model should be serialized with joblib version 1.4.2 or newer.
  • Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the checks in the autograder.

To obtain the maximum points via the autograder, your model must outperform the following metrics:

Test Accuracy: 1.0
Production Accuracy: 0.99

For this lab, be aware that the class imbalance means that a single test set gives a highly variable evaluation of model performance.

Submission

Before submission, especially of your report, you should be sure to review the Lab Policy document.

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.