Course Content
CS 307 content will largely be structured around three quizzes and each quiz will correspond to three modules of content. Additionally, we will have an introductory and concluding module.
Introduction
The first group of modules (containing only one module) will serve as an introduction to CS 307.
Module 00
In this module, you will become familiar with the course and we will get your machine setup to complete homework and labs. We will then overview the fundamental machine learning tasks, introduce two very basic methods, and definite basic metrics for assessing supervised learning metrics.
Topics
- Machine Learning Tasks
- Supervised Learning
- Classification
- Regression
- Unsupervised Learning
- Density Estimation
- Clustering
- Novelty and Outlier Detection
- Dimension Reduction
- Reinforcement Learning
- Supervised Learning
- Baseline Methods
DummyClassifier
DummyRegressor
- Supervised Learning Metrics
- Regression
- Root Mean Square Error (RMSE)
- Mean Absolute Error (MAE)
- Mean Absolute Percentage Error (MAPE)
- Coefficient of Determination (\(R^2\))
- Max Error
- Classification
- Accuracy
- Misclassification
- Regression
Learning Objectives
After completing this module, you are expected to be able to:
- Understand the syllabus of the course.
- Understand the objectives of the course.
- Communicate with the course staff.
- Use Python, Jupyter, and VSCode to produce code for labs, quizzes, and MPs.
- Use PrairieLearn to complete homework, lab models, and MPs.
- Use Canvas to complete lab reports.
- Differentiate between supervised, unsupervised, and reinforcement learning.
- Identify regression and classification tasks.
- Use
sklearn
baseline modelsDummyClassifier
andDummyRegressor
. - Calculate metrics to evaluate predictions from regression and classification methods.
Slides, Scribbles, and Readings
Quiz 01
Quiz 01 will focus on two foundational nonparametric supervised learning methods: k-nearest neighbors (KNN) and decision trees. Both methods can be used for both classification and regression tasks. While discussing these methods, we will introduce the notion of generalization, and tools for model selection such as cross-validation.
Module 01
In Module 01 we will begin discussing supervised learning, both the regression and classification tasks. We will look at one of the foundational methods of machine learning: k-nearest neighbors. We will also introduce data splitting and overfitting.
Topics
- K-Nearest Neighbors (KNN) Regression
KNeighborsRegressor
- K-Nearest Neighbors (KNN) Classification
KNeighborsClassifier
- Overfitting
- Train, Test, and Validation Datasets
train_test_split
- Object-Oriented Programming (OOP) in Python
Learning Objectives
- Differentiate between regression and classification tasks.
- Use k-nearest neighbors to make predictions for pre-processed data.
- Understand how conditional probabilities relate to classifications.
- Estimate and calculate conditional probabilities.
- Use k-nearest neighbors to estimate conditional probabilities.
- Split data into train, validation, and test sets.
- Modify a tuning parameter to control the flexibility of a model.
- Avoid overfitting by selecting an a model through the use of a validation set.
Slides, Scribbles, and Readings
Module 02
In Module 02 we will look at the bigger picture and focus on selecting tuning parameters and preprocessing data, especially heterogenous data stored in Pandas DataFrames
objects. We’ll also discuss overfitting, generalization, and a concept related both: the bias-variance tradeoff.
Topics
- Bias-Variance Tradeoff
- Generalization
- Cross-Validation
- Preprocessing
sklearn
API and Pipelines
Learning Objectives
- Understand how model flexibility relates to the bias-variance tradeoff and thus model performance.
- Tune models by manipulating their flexibility through the use of a tuning parameter to find a model that generalizes well.
- Avoid overfitting by selecting a model of appropriate flexibility through the use of a validation set or cross-validation.
- Use
sklearn
features such asPipeline
,ColumnTransformer
,SimpleImputer
,StandardScaler
,OneHotEncoder
and others to perform reproducible preprocessing. - Use
GridSearchCV
to tune models (select appropriate values of tuning parameters) with cross-validation.
Slides, Scribbles, and Readings
Module 03
In Module 03 we will introduce another nonparametric method for supervised learning: decision trees.
Topics
- Regression Trees
- Classification Trees
Learning Objectives
- Understand how decision trees differ from KNN when determining closeness of data.
- Find and evaluate decision tree splits for regression.
- Find and evaluate decision tree splits for classification.
- Use decision trees to make predictions for regression tasks using
sklearn
. - Use decision trees to make predictions for classification tasks using
sklearn
. - Use decision trees to estimate conditional probabilities for classification tasks using
sklearn
. - Tune the parameters of decision trees to avoid overfitting.
Slides, Scribbles, and Readings
Quiz 02
Quiz 02 will introduce linear methods for classification and regression, which will also present an opportunity to differentiate parametric and nonparametric methods. Then, we’ll modify existing methods that we have seen through the use of regularization and ensembles. Lastly, we’ll spend some time thinking about the specifics of evaluating binary classification, and some other miscellaneous practical concerns.
Module 04
In Module 04 we will introduce linear models for classification and regression, both parametric methods. We will begin to compare and contrast parametric and nonparametric methods.
Topics
- Linear Models
- Linear Regression
- Logistic Regression
- Parametric versus Nonparametric Models
Learning Objectives
- Differentiate between parametric and nonparametric regression.
- Use
sklearn
to fit linear regression models and make predictions for unseen data. - Estimate conditional probabilities with logistic regression.
- Use
sklearn
to fit logistic regression models and make predictions for unseen data.. - Preprocess data to add polynomial and interaction terms for use in linear models.
- Understand what makes linear models linear and how both linear regression and logistic regression are linear model.
Slides, Scribbles, and Readings
Module 05
In Module 05 we will modify previously seen methods to potentially improve their performance. We’ll look at ensembles of trees and add regularization to regression.
Topics
- Ensemble Methods
- Random Forests
- Boosted Models
- Regularization
- Lasso
- Ridge
Learning Objectives
- Understand how the ridge and lasso constraints lead to shrunken and spare estimates.
- Use ridge regression to perform regression and classification.
- Use lasso to perform regression and classification.
- Understand how averaging the predictions from many trees (for example a random forest) can improve model performance.
- Use a random forest to perform regression and classification.
- Use boosting to perform regression and classification.
Slides, Scribbles, and Readings
Module 06
In Module 06 we will discuss binary classification in depth, in particular, metrics for evaluating binary classification models.
Topics
- Binary Classification
- Model Evaluation
- Practical Considerations
Learning Objectives
- Understand the definitions of false positives, false negatives, and related metrics.
- Calculate metrics specific to binary classification.
Slides, Scribbles, and Readings
Quiz 03
Quiz 03 will wrap up our discussion of supervised learning with an introduction to generative models. Then, a brief detour to talk about unsupervised learning before diving into neural networks.
Module 07
In Module 07 we will introduce generative models for classification, with an emphasis on Naive Bayes.
Topics
- Generative Models
- Naive Bayes
- Linear Discriminant Analysis (LDA)
- Quadratic Discriminant Analysis (QDA)
Learning Objectives
- Understand the difference between discriminative and generative models.
- Use Naive Bayes models for the classification task.
Slides, Scribbles, and Readings
Module 08
In Module 08 we will introduce unsupervised learning. We will look at a variety of methods for the various subtasks: dimension reduction, clustering, density estimation, and outlier detection.
Topics
- Unsupervised Learning
- Dimension Reduction
- PCA
- Clustering
- k-Means
- Agglomerative Clustering
- DBSCAN
- Density Estimation
- Kernel Density Estimation
- Gaussian Mixture Models
- Outlier Detections
- One-Class SVM
- Isolation Forest
- Dimension Reduction
Learning Objectives
- Understand the difference between supervised and unsupervised machine learning tasks.
- Identify supervised and unsupervised machine learning tasks.
- Understand and identify unsupervised learning subtasks: dimension reduction, clustering, density estimation, and outlier detection.
- Use principal components analysis (PCA) for dimension reduction.
- Use k-means and other methods for clustering.
- Use kernel density estimation and mixture models for density estimation.
- Use one-class SVM and isolation forest for outlier detection.
Slides, Scribbles, and Readings
Module 09
Miscellaneous topics and recap.
Conclusion
To conclude the course, we will take a brief look at neural networks and so-called deep learning.
Module 10
In Module 10 we will introduce neural networks and deep learning using PyTorch.
Topics
- Neural Networks
- Deep Learning
- PyTorch
Learning Objectives
- Train neural networks using pytorch.
- Evaluate neural networks using pytorch.