Course Content

CS 307 content will largely be structured around three quizzes and each quiz will correspond to three modules of content. Additionally, we will have an introductory and concluding module.

Introduction

The first group of modules (containing only one module) will serve as an introduction to CS 307.

Module 00

In this module, you will become familiar with the course and we will get your machine setup to complete homework and labs. We will then overview the fundamental machine learning tasks, introduce two very basic methods, and definite basic metrics for assessing supervised learning metrics.

Topics

  • Machine Learning Tasks
    • Supervised Learning
      • Classification
      • Regression
    • Unsupervised Learning
      • Density Estimation
      • Clustering
      • Novelty and Outlier Detection
      • Dimension Reduction
    • Reinforcement Learning
  • Baseline Methods
    • DummyClassifier
    • DummyRegressor
  • Supervised Learning Metrics
    • Regression
      • Root Mean Square Error (RMSE)
      • Mean Absolute Error (MAE)
      • Mean Absolute Percentage Error (MAPE)
      • Coefficient of Determination (\(R^2\))
      • Max Error
    • Classification
      • Accuracy
      • Misclassification

Learning Objectives

After completing this module, you are expected to be able to:

  • Understand the syllabus of the course.
  • Understand the objectives of the course.
  • Communicate with the course staff.
  • Use Python, Jupyter, and VSCode to produce code for labs, quizzes, and MPs.
  • Use PrairieLearn to complete homework, lab models, and MPs.
  • Use Canvas to complete lab reports.
  • Differentiate between supervised, unsupervised, and reinforcement learning.
  • Identify regression and classification tasks.
  • Use sklearn baseline models DummyClassifier and DummyRegressor.
  • Calculate metrics to evaluate predictions from regression and classification methods.

Slides, Scribbles, and Readings

Quiz 01

Quiz 01 will focus on two foundational nonparametric supervised learning methods: k-nearest neighbors (KNN) and decision trees. Both methods can be used for both classification and regression tasks. While discussing these methods, we will introduce the notion of generalization, and tools for model selection such as cross-validation.

Module 01

In Module 01 we will begin discussing supervised learning, both the regression and classification tasks. We will look at one of the foundational methods of machine learning: k-nearest neighbors. We will also introduce data splitting and overfitting.

Topics

  • K-Nearest Neighbors (KNN) Regression
    • KNeighborsRegressor
  • K-Nearest Neighbors (KNN) Classification
    • KNeighborsClassifier
  • Overfitting
  • Train, Test, and Validation Datasets
    • train_test_split
  • Object-Oriented Programming (OOP) in Python

Learning Objectives

  • Differentiate between regression and classification tasks.
  • Use k-nearest neighbors to make predictions for pre-processed data.
  • Understand how conditional probabilities relate to classifications.
  • Estimate and calculate conditional probabilities.
  • Use k-nearest neighbors to estimate conditional probabilities.
  • Split data into train, validation, and test sets.
  • Modify a tuning parameter to control the flexibility of a model.
  • Avoid overfitting by selecting an a model through the use of a validation set.

Slides, Scribbles, and Readings

Module 02

In Module 02 we will look at the bigger picture and focus on selecting tuning parameters and preprocessing data, especially heterogenous data stored in Pandas DataFrames objects. We’ll also discuss overfitting, generalization, and a concept related both: the bias-variance tradeoff.

Topics

  • Bias-Variance Tradeoff
  • Generalization
  • Cross-Validation
  • Preprocessing
  • sklearn API and Pipelines

Learning Objectives

  • Understand how model flexibility relates to the bias-variance tradeoff and thus model performance.
  • Tune models by manipulating their flexibility through the use of a tuning parameter to find a model that generalizes well.
  • Avoid overfitting by selecting a model of appropriate flexibility through the use of a validation set or cross-validation.
  • Use sklearn features such as Pipeline, ColumnTransformer, SimpleImputer, StandardScaler, OneHotEncoder and others to perform reproducible preprocessing.
  • Use GridSearchCV to tune models (select appropriate values of tuning parameters) with cross-validation.

Slides, Scribbles, and Readings

Module 03

In Module 03 we will introduce another nonparametric method for supervised learning: decision trees.

Topics

  • Regression Trees
  • Classification Trees

Learning Objectives

  • Understand how decision trees differ from KNN when determining closeness of data.
  • Find and evaluate decision tree splits for regression.
  • Find and evaluate decision tree splits for classification.
  • Use decision trees to make predictions for regression tasks using sklearn.
  • Use decision trees to make predictions for classification tasks using sklearn.
  • Use decision trees to estimate conditional probabilities for classification tasks using sklearn.
  • Tune the parameters of decision trees to avoid overfitting.

Slides, Scribbles, and Readings

Quiz 02

Quiz 02 will introduce linear methods for classification and regression, which will also present an opportunity to differentiate parametric and nonparametric methods. Then, we’ll modify existing methods that we have seen through the use of regularization and ensembles. Lastly, we’ll spend some time thinking about the specifics of evaluating binary classification, and some other miscellaneous practical concerns.

Module 04

In Module 04 we will introduce linear models for classification and regression, both parametric methods. We will begin to compare and contrast parametric and nonparametric methods.

Topics

  • Linear Models
    • Linear Regression
    • Logistic Regression
  • Parametric versus Nonparametric Models

Learning Objectives

  • Differentiate between parametric and nonparametric regression.
  • Use sklearn to fit linear regression models and make predictions for unseen data.
  • Estimate conditional probabilities with logistic regression.
  • Use sklearn to fit logistic regression models and make predictions for unseen data..
  • Preprocess data to add polynomial and interaction terms for use in linear models.
  • Understand what makes linear models linear and how both linear regression and logistic regression are linear model.

Slides, Scribbles, and Readings

Module 05

In Module 05 we will modify previously seen methods to potentially improve their performance. We’ll look at ensembles of trees and add regularization to regression.

Topics

  • Ensemble Methods
    • Random Forests
    • Boosted Models
  • Regularization
    • Lasso
    • Ridge

Learning Objectives

  • Understand how the ridge and lasso constraints lead to shrunken and spare estimates.
  • Use ridge regression to perform regression and classification.
  • Use lasso to perform regression and classification.
  • Understand how averaging the predictions from many trees (for example a random forest) can improve model performance.
  • Use a random forest to perform regression and classification.
  • Use boosting to perform regression and classification.

Slides, Scribbles, and Readings

Module 06

In Module 06 we will discuss binary classification in depth, in particular, metrics for evaluating binary classification models.

Topics

  • Binary Classification
  • Model Evaluation
  • Practical Considerations

Learning Objectives

  • Understand the definitions of false positives, false negatives, and related metrics.
  • Calculate metrics specific to binary classification.

Slides, Scribbles, and Readings

Quiz 03

Quiz 03 will wrap up our discussion of supervised learning with an introduction to generative models. Then, a brief detour to talk about unsupervised learning before diving into neural networks.

Module 07

In Module 07 we will introduce generative models for classification, with an emphasis on Naive Bayes.

Topics

  • Generative Models
    • Naive Bayes
    • Linear Discriminant Analysis (LDA)
    • Quadratic Discriminant Analysis (QDA)

Learning Objectives

  • Understand the difference between discriminative and generative models.
  • Use Naive Bayes models for the classification task.

Slides, Scribbles, and Readings

Module 08

In Module 08 we will introduce unsupervised learning. We will look at a variety of methods for the various subtasks: dimension reduction, clustering, density estimation, and outlier detection.

Topics

  • Unsupervised Learning
    • Dimension Reduction
      • PCA
    • Clustering
      • k-Means
      • Agglomerative Clustering
      • DBSCAN
    • Density Estimation
      • Kernel Density Estimation
      • Gaussian Mixture Models
    • Outlier Detections
      • One-Class SVM
      • Isolation Forest

Learning Objectives

  • Understand the difference between supervised and unsupervised machine learning tasks.
  • Identify supervised and unsupervised machine learning tasks.
  • Understand and identify unsupervised learning subtasks: dimension reduction, clustering, density estimation, and outlier detection.
  • Use principal components analysis (PCA) for dimension reduction.
  • Use k-means and other methods for clustering.
  • Use kernel density estimation and mixture models for density estimation.
  • Use one-class SVM and isolation forest for outlier detection.

Slides, Scribbles, and Readings

Module 09

Miscellaneous topics and recap.

Conclusion

To conclude the course, we will take a brief look at neural networks and so-called deep learning.

Module 10

In Module 10 we will introduce neural networks and deep learning using PyTorch.

Topics

  • Neural Networks
  • Deep Learning
  • PyTorch

Learning Objectives

  • Train neural networks using pytorch.
  • Evaluate neural networks using pytorch.

Slides, Scribbles, and Readings

Additional Reading and Resources