# basics
import pandas as pd
import numpy as np
# data
from sklearn.datasets import load_breast_cancer
# machine learning
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ParameterGrid
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
The Whole Game
Using sklearn
for Supervised Learning
This document will serve as a high-level sklearn
tutorial and guide for the supervised learning task.
Loading Data
To perform supervised learning in sklearn
, data must first be loaded into Python. There are many, many ways to do so.
One of the most common tools to load data into Python for data science is the input-output functionality of pandas
.
Two specific functions that we have seen and used are read_csv
and read_parquet
.
Data Splitting
After data has been loaded, and before any inspection or analysis, it should be train-test split to avoid data leakage. In sklearn
, the train_test_split
function is available to perform this task.
# train-test split
= train_test_split(
train_data, test_data
full_data,=0.20,
test_size=42,
random_state=full_data["Target"],
stratify )
The above example assumes that full_data
contains both the features and target, in this case with a target named Target
.
- The
test_size
parameter controls how much of the data is withheld for the test set. - The
random_state
parameter fixes the randomization, as the splitting is done at random by default, for reproducibility. - The
stratify
parameter is useful for splitting in classification tasks, especially those that may have severe class imbalance.
If the initial data contains columns for both the features and the target, it is necessary to further separate the data into an object that contains only the features (often called X
) and object that contains only the target variable (often called y
). This should be done to both the train and test data.
# create X and y for train
= train_data.drop(columns="Target")
X_train = train_data["Target"]
y_train
# create X and y for test
= test_data.drop(columns="Target")
X_test = test_data["Target"] y_test
Sometimes, data can be loaded pre-separated into features (X
) and target (y
) objects. In that case, the splitting code looks slightly different, and it becomes important to be aware of the order of the objects returned.
# train-test split
= train_test_split(
X_train, X_test, y_train, y_test
X,
y,=0.25,
test_size=42,
random_state )
Available Models
There are many potential models that could be fit to the train data. Let’s create a rough categorization of the models that we have seen so far.
Baseline Models
- Dummy Models
Basic Models
- \(k\)-Nearest Neighbors
- Decision Tree
- Linear Models
Regularized Models
- Lasso
- Ridge
Ensembles
- Random Forest
- Boosted Models
These are simply what we have seen, but there are many, many more potential models we could fit! However, this set of models will serve you well, as they are practically useful. More importantly, if you understand how to work with these models, you can easily work with any model available in sklearn
.
Each of the above can be used for either regression (often containing Regressor
in its name) or classification (often containing Classifier
in its name) by using the specific class within sklearn
for the desired task.1
When comparing and contrasting these models, there are several questions you should ask?
- What are the available tuning parameters?
- What is the relationship of these parameters to the model’s flexibility?
- Which parameters are most useful to tune?
- Is the model a strong or a weak learner?
- Can a parameter be used to make the model strong or weak?
- Can the model learn nonlinear relationships and interaction?
- Is the model sensitive to the scaling of input features?
- Is the model fast to train?
- Is the model fast to predict?
- What is learned and thus required to be stored?
- What is the size of the model when stored?
Model Fitting
The beauty of sklearn
is its API design, and thus the similarity of usage of these models. Machine Learning models in sklearn
are implemented as classes. Thus, generally before fitting a model, it first must be instantiated (created).
= load_breast_cancer(as_frame=True).frame wine
wine
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.30010 | 0.14710 | 0.2419 | 0.07871 | ... | 17.33 | 184.60 | 2019.0 | 0.16220 | 0.66560 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | 0 |
1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.08690 | 0.07017 | 0.1812 | 0.05667 | ... | 23.41 | 158.80 | 1956.0 | 0.12380 | 0.18660 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | 0 |
2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.19740 | 0.12790 | 0.2069 | 0.05999 | ... | 25.53 | 152.50 | 1709.0 | 0.14440 | 0.42450 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | 0 |
3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.24140 | 0.10520 | 0.2597 | 0.09744 | ... | 26.50 | 98.87 | 567.7 | 0.20980 | 0.86630 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | 0 |
4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.19800 | 0.10430 | 0.1809 | 0.05883 | ... | 16.67 | 152.20 | 1575.0 | 0.13740 | 0.20500 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
564 | 21.56 | 22.39 | 142.00 | 1479.0 | 0.11100 | 0.11590 | 0.24390 | 0.13890 | 0.1726 | 0.05623 | ... | 26.40 | 166.10 | 2027.0 | 0.14100 | 0.21130 | 0.4107 | 0.2216 | 0.2060 | 0.07115 | 0 |
565 | 20.13 | 28.25 | 131.20 | 1261.0 | 0.09780 | 0.10340 | 0.14400 | 0.09791 | 0.1752 | 0.05533 | ... | 38.25 | 155.00 | 1731.0 | 0.11660 | 0.19220 | 0.3215 | 0.1628 | 0.2572 | 0.06637 | 0 |
566 | 16.60 | 28.08 | 108.30 | 858.1 | 0.08455 | 0.10230 | 0.09251 | 0.05302 | 0.1590 | 0.05648 | ... | 34.12 | 126.70 | 1124.0 | 0.11390 | 0.30940 | 0.3403 | 0.1418 | 0.2218 | 0.07820 | 0 |
567 | 20.60 | 29.33 | 140.10 | 1265.0 | 0.11780 | 0.27700 | 0.35140 | 0.15200 | 0.2397 | 0.07016 | ... | 39.42 | 184.60 | 1821.0 | 0.16500 | 0.86810 | 0.9387 | 0.2650 | 0.4087 | 0.12400 | 0 |
568 | 7.76 | 24.54 | 47.92 | 181.0 | 0.05263 | 0.04362 | 0.00000 | 0.00000 | 0.1587 | 0.05884 | ... | 30.37 | 59.16 | 268.6 | 0.08996 | 0.06444 | 0.0000 | 0.0000 | 0.2871 | 0.07039 | 1 |
569 rows × 31 columns
# train-test split
= train_test_split(
wine_train, wine_test
wine,=0.20,
test_size=42,
random_state=wine["target"],
stratify )
# create X and y for train
= wine_train.drop(columns="target")
X_train = wine_train["target"]
y_train
# create X and y for test
= wine_test.drop(columns="target")
X_test = wine_test["target"] y_test
Let’s first use a random forest as an example. To quickly obtain a list of the available parameters, and their default values, we can use the get_params
method.
RandomForestClassifier().get_params()
{'bootstrap': True,
'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': 'sqrt',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'monotonic_cst': None,
'n_estimators': 100,
'n_jobs': None,
'oob_score': False,
'random_state': None,
'verbose': 0,
'warm_start': False}
Note that what we’re really doing here is first instantiating an instance of the random forest class with RandomForestClassifier()
, then obtain the parameters and the values used to initialize the random forest. It just so happens that we used all the default parameter values.
Instead of using the default values, let’s modify a couple, and give this instance a name, rf
.
= RandomForestClassifier(
rf =25,
n_estimators=10,
max_depth=42,
random_state )
We can again use get_params
to check the parameter values, this time, verifying that n_estimators
, max_depth
, and random_state
were set correctly.
rf.get_params()
{'bootstrap': True,
'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': 10,
'max_features': 'sqrt',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'monotonic_cst': None,
'n_estimators': 25,
'n_jobs': None,
'oob_score': False,
'random_state': 42,
'verbose': 0,
'warm_start': False}
In addition to get_params
, there are several other common and important methods for sklearn
model classes.
get_params
fit
predict
predict_proba
2score
Importantly, the fit
method must be called before any of predict
, predict_proba
, or score
. What happens if a model is not fit before calling these other methods?
rf.predict(X_test)
--------------------------------------------------------------------------- NotFittedError Traceback (most recent call last) Cell In[9], line 1 ----> 1 rf.predict(X_test) File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/sklearn/ensemble/_forest.py:904, in ForestClassifier.predict(self, X) 883 def predict(self, X): 884 """ 885 Predict class for X. 886 (...) 902 The predicted classes. 903 """ --> 904 proba = self.predict_proba(X) 906 if self.n_outputs_ == 1: 907 return self.classes_.take(np.argmax(proba, axis=1), axis=0) File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/sklearn/ensemble/_forest.py:944, in ForestClassifier.predict_proba(self, X) 922 def predict_proba(self, X): 923 """ 924 Predict class probabilities for X. 925 (...) 942 classes corresponds to that in the attribute :term:`classes_`. 943 """ --> 944 check_is_fitted(self) 945 # Check data 946 X = self._validate_X_predict(X) File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/sklearn/utils/validation.py:1757, in check_is_fitted(estimator, attributes, msg, all_or_any) 1754 return 1756 if not _is_fitted(estimator, attributes, all_or_any): -> 1757 raise NotFittedError(msg % {"name": type(estimator).__name__}) NotFittedError: This RandomForestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
Oh no! An error. While the entire error message is lengthy, as always, you should read error message from the bottom to the top. Read the last line. Well look at that! This error message is excellent, and in rather plain terms, tells us the issue and how to fix it.
Let’s actually fit this model, which requires supply an X
and y
.
= rf.fit(X_train, y_train) _
We assign the output the name _
to suppress the output, as the important information is stored in the class itself. If you’d like to see the output, simply run rf
.
rf
RandomForestClassifier(max_depth=10, n_estimators=25, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_depth=10, n_estimators=25, random_state=42)
Now that we’ve fit the model, we can use the other methods.
rf.predict(X_test)
array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,
1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0,
0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,
1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
1, 0, 1, 1])
The predict
method takes an X
as input, and outputs the predicted target for each sample of X
.
rf.score(X_test, y_test)
0.956140350877193
The score
methods takes both an X
and y
and then “scores” the result. How does it do this scoring? It depends on the model, but generally for classification it provides accuracy and for regression it uses \(R^2\).3 We generally suggest ignoring the score
method. Instead, we recommend using the metric functions discussed later.
10]) rf.predict_proba(X_test[:
array([[1. , 0. ],
[0. , 1. ],
[0.88, 0.12],
[0.68, 0.32],
[0.96, 0.04],
[0.04, 0.96],
[0.04, 0.96],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ]])
Lastly, for most classification methods, we can use predict_proba
to obtain the estimated conditional probability for each category of the target, for each sample provided via X
.
While we used a random forest as an example, except for setting n_estimators
and max_depth
, nothing we’ve done here is specific to random forests, and applies to all machine learning methods available in sklearn
.
Metrics and Evaluation
Once models have been fit, they need to be evaluated. While model fitting is often more interesting to study, model evaluation is at least as important, if not more important. Choosing the appropriate evaluation strategy can make or break a model’s effectiveness in practice.
Within sklearn
many potential metrics for evaluation are implemented. The User Guide provides an overview of the available metrics, while highlighting two ways the metrics can be used.
The table in section 3.4.1.1 groups the metrics by their associated task, one of: classification, clustering, and regression. The Scoring and Function columns express how to utilize each metrics for scoring or as a function.
The function variant computes the metric given input data, which usually includes the truth (y_true
) and the predicted values (y_pred
). The scoring variant, which is simply a string, is used to define which metric will be used to evaluate models during cross-validation procedures such as GridSearchCV
.
Let’s focus on the function version for now, and we’ll return to the scoring variant when we discuss tuning.
To demonstrate, we’ll first need some predictions from our learned model.
= rf.predict(X_test) y_pred
The parameters of the scoring functions follow a pattern: the true values first, the predicted values second. We reinforce this by first demonstrating the use of f1_score
while naming the parameters.
=y_test, y_pred=y_pred) f1_score(y_true
0.9655172413793104
However in practice, the parameter names are usually suppressed.
f1_score(y_test, y_pred)
0.9655172413793104
We note this because many metrics are not affected by the order in which the true and predicted values are supplied, but this is not true in general. The order does matter.
Tuning and Searching
The most common approach to tuning a model in sklearn
is a combination of cross-validation and a grid search.
According to the above user guide, a search consists of:
- an estimator (regressor or classifier such as
sklearn.svm.SVC()
); - a parameter space;
- a method for searching or sampling candidates;
- a cross-validation scheme; and
- a score function.
These components are neatly combined via the GridSearchCV
function.
Let’s look at an example.
First, we’ll pick an estimator. Here, we’ve chosen a decision tree classifier.
= DecisionTreeClassifier(random_state=42) dtc
Note that we are setting random_state=42
. While you certainly could tune this parameter, it would be a rather silly thing to do! Instead, we set this parameter, which will be fixed and used in the remainder of this example, to control the random elements of decision trees.
Next, we can specify the parameter space that we will search in. This effectively amounts to specifying the values of each parameter that will be considered.
= {
dtc_grid "max_depth": [1, 3, 5, 15, 25, None],
"splitter": ["best", "random"]
}
Within GridSearchCV
, a “grid” of these values will be considered. In this case, that would be trying both the "best"
and "random"
splitter for each value of max_depth
. To see the fully expanded grid, we can use ParameterGrid
, which GridSearchCV
uses internally.
pd.DataFrame(ParameterGrid(dtc_grid))
max_depth | splitter | |
---|---|---|
0 | 1.0 | best |
1 | 1.0 | random |
2 | 3.0 | best |
3 | 3.0 | random |
4 | 5.0 | best |
5 | 5.0 | random |
6 | 15.0 | best |
7 | 15.0 | random |
8 | 25.0 | best |
9 | 25.0 | random |
10 | NaN | best |
11 | NaN | random |
How should you decide which parameters to consider, and what values of those parameters to include in the grid? That’s a difficult question to answer. For any specific model, usually a few key parameters are useful, like \(k\) for KNN and max_depth
for tree-based models. The usefulness of many parameters across many models is really an intuition that must be developed through practice and experimentation. Remember, that intuition is at best a heuristic, and there are not “rules” about which parameters and values should be used. The magnitudes of the values of the parameters are an important consideration. (Think \(k\) in KNN versus \(\lambda\) for regularized linear models.)
If you’re unsure where to start, start small. Try a “small” grid and iterate. The above example is both reasonably small, but for its purpose, effective. Only move to a “large” grid if necessary, and you have the time. The bigger the grid, the more compute time needed!
Our method for searching will be an exhaustive grid search.4 That is, we will simply try each parameter combination included in the grid. This is exactly what GridSearchCV
does.
= GridSearchCV(
dtc_tuned
dtc,
dtc_grid,=5,
cv=[
scoring"accuracy",
"precision",
"recall",
"f1",
],="f1",
refit )
The first parameter (estimator
), that we give dtc
, specifies the estimator to be used.
The second parameter (param_grid
), that we give dtc_grid
, specifies the parameter space that will be searched.
The search method is implicit in the use of GridSearchCV
.
The third parameter, cv
, that we give 5
, specifies the cross-validation scheme. In this case, we are using the (default) value 5
, which specifies 5-fold cross-validation. In general, we highly recommend using 5
and not changing this value.
The fourth parameter scoring
, specifies the scoring to be used within GridSearchCV
. That is, it specifies the metrics that will be cross-validated. The potential values here are those listed in the previously referenced Metrics and Scoring section of the sklearn
User Guide, specifically, in the “Scoring” column of Table 3.4.1.1. Here, we’re providing a list of scoring methods, or alternatively you can provide a single scoring method.
Given that we provided a list of scoring methods, we have also supplied a value for the refit
parameter, in this case f1
. Doing so tells GridSearchCV
which of the scoring methods to use when choosing the “best” model. In this case, the model with the best \(F_1\) score will be selected. If a single scoring method is supplied to scoring
, that method will be used as the scoring method for refit
which needing to specify it.
The refit
parameter is a nice feature of GridSearchCV
. It “refits” the best model found to the provided training data. It then allows the use of methods like predict
and predict_proba
on the object return from GridSearchCV
as if the model have been directly fit to the training data as we did above.
Before we can use predict
, like models themselves, we first need to “fit” our GridSearchCV
object with its fit
method. Like before, we specify the X
and y
components of the training data.
dtc_tuned.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=42), param_grid={'max_depth': [1, 3, 5, 15, 25, None], 'splitter': ['best', 'random']}, refit='f1', scoring=['accuracy', 'precision', 'recall', 'f1'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=42), param_grid={'max_depth': [1, 3, 5, 15, 25, None], 'splitter': ['best', 'random']}, refit='f1', scoring=['accuracy', 'precision', 'recall', 'f1'])
DecisionTreeClassifier(max_depth=3, random_state=42, splitter='random')
DecisionTreeClassifier(max_depth=3, random_state=42, splitter='random')
To “see” the model selected, we can check the best_estimator_
attribute.
dtc_tuned.best_estimator_
DecisionTreeClassifier(max_depth=3, random_state=42, splitter='random')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=3, random_state=42, splitter='random')
To simply see the best parameter values that were select, we can access the best_param_
attribute.
dtc_tuned.best_params_
{'max_depth': 3, 'splitter': 'random'}
The best_estimator_
attribute contains a fitted version of the selected model.
10] dtc_tuned.best_estimator_.predict(X_test)[:
array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0])
However, it is generally not necessary to access best_estimator_
! Instead, you can simply use methods like predict
on a GridSearchCV
object that has been fit.
10] dtc_tuned.predict(X_test)[:
array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0])
The cv_results_
attribute collects details of the scoring, including the mean and standard deviation of each scoring method supplied.
pd.DataFrame(dtc_tuned.cv_results_)
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_max_depth | param_splitter | params | split0_test_accuracy | split1_test_accuracy | split2_test_accuracy | ... | std_test_recall | rank_test_recall | split0_test_f1 | split1_test_f1 | split2_test_f1 | split3_test_f1 | split4_test_f1 | mean_test_f1 | std_test_f1 | rank_test_f1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.003214 | 0.000299 | 0.006699 | 0.000136 | 1 | best | {'max_depth': 1, 'splitter': 'best'} | 0.912088 | 0.912088 | 0.868132 | ... | 0.023798 | 11 | 0.928571 | 0.929825 | 0.901639 | 0.902655 | 0.928571 | 0.918252 | 0.013162 | 11 |
1 | 0.001814 | 0.000016 | 0.006481 | 0.000064 | 1 | random | {'max_depth': 1, 'splitter': 'random'} | 0.890110 | 0.901099 | 0.901099 | ... | 0.039072 | 12 | 0.910714 | 0.915888 | 0.918919 | 0.897196 | 0.865385 | 0.901620 | 0.019586 | 12 |
2 | 0.005306 | 0.000020 | 0.006503 | 0.000019 | 3 | best | {'max_depth': 3, 'splitter': 'best'} | 0.934066 | 0.923077 | 0.901099 | ... | 0.021053 | 1 | 0.947368 | 0.938053 | 0.923077 | 0.925620 | 0.973913 | 0.941606 | 0.018376 | 7 |
3 | 0.001928 | 0.000010 | 0.006503 | 0.000108 | 3 | random | {'max_depth': 3, 'splitter': 'random'} | 0.967033 | 0.978022 | 0.879121 | ... | 0.037791 | 3 | 0.974359 | 0.982456 | 0.902655 | 0.938053 | 0.956522 | 0.950809 | 0.028532 | 1 |
4 | 0.006913 | 0.000233 | 0.006534 | 0.000035 | 5 | best | {'max_depth': 5, 'splitter': 'best'} | 0.945055 | 0.945055 | 0.901099 | ... | 0.017891 | 2 | 0.956522 | 0.955752 | 0.923077 | 0.929825 | 0.965517 | 0.946139 | 0.016576 | 3 |
5 | 0.002051 | 0.000041 | 0.006435 | 0.000039 | 5 | random | {'max_depth': 5, 'splitter': 'random'} | 0.923077 | 0.967033 | 0.901099 | ... | 0.023798 | 7 | 0.938053 | 0.973451 | 0.918919 | 0.947368 | 0.955752 | 0.946709 | 0.018136 | 2 |
6 | 0.007616 | 0.000772 | 0.006528 | 0.000030 | 15 | best | {'max_depth': 15, 'splitter': 'best'} | 0.912088 | 0.901099 | 0.901099 | ... | 0.024811 | 8 | 0.928571 | 0.918919 | 0.923077 | 0.913793 | 0.956522 | 0.928176 | 0.014981 | 8 |
7 | 0.002104 | 0.000055 | 0.006487 | 0.000102 | 15 | random | {'max_depth': 15, 'splitter': 'random'} | 0.912088 | 0.956044 | 0.923077 | ... | 0.026257 | 4 | 0.928571 | 0.964286 | 0.939130 | 0.957265 | 0.928571 | 0.943565 | 0.014740 | 4 |
8 | 0.007677 | 0.000787 | 0.006752 | 0.000405 | 25 | best | {'max_depth': 25, 'splitter': 'best'} | 0.912088 | 0.901099 | 0.901099 | ... | 0.024811 | 8 | 0.928571 | 0.918919 | 0.923077 | 0.913793 | 0.956522 | 0.928176 | 0.014981 | 8 |
9 | 0.002113 | 0.000063 | 0.006431 | 0.000017 | 25 | random | {'max_depth': 25, 'splitter': 'random'} | 0.912088 | 0.956044 | 0.923077 | ... | 0.026257 | 4 | 0.928571 | 0.964286 | 0.939130 | 0.957265 | 0.928571 | 0.943565 | 0.014740 | 4 |
10 | 0.007571 | 0.000787 | 0.006592 | 0.000125 | None | best | {'max_depth': None, 'splitter': 'best'} | 0.912088 | 0.901099 | 0.901099 | ... | 0.024811 | 8 | 0.928571 | 0.918919 | 0.923077 | 0.913793 | 0.956522 | 0.928176 | 0.014981 | 8 |
11 | 0.002171 | 0.000068 | 0.006445 | 0.000019 | None | random | {'max_depth': None, 'splitter': 'random'} | 0.912088 | 0.956044 | 0.923077 | ... | 0.026257 | 4 | 0.928571 | 0.964286 | 0.939130 | 0.957265 | 0.928571 | 0.943565 | 0.014740 | 4 |
12 rows × 39 columns
Here, we’ve wrapped the results using pd.DataFrame()
to make the results more readable. Additionally, we can inspect specific columns to further increase readability.
pd.DataFrame(dtc_tuned.cv_results_)[
["param_max_depth",
"param_splitter",
"mean_test_f1",
"mean_test_accuracy",
] ]
param_max_depth | param_splitter | mean_test_f1 | mean_test_accuracy | |
---|---|---|---|---|
0 | 1 | best | 0.918252 | 0.896703 |
1 | 1 | random | 0.901620 | 0.883516 |
2 | 3 | best | 0.941606 | 0.925275 |
3 | 3 | random | 0.950809 | 0.938462 |
4 | 5 | best | 0.946139 | 0.931868 |
5 | 5 | random | 0.946709 | 0.934066 |
6 | 15 | best | 0.928176 | 0.909890 |
7 | 15 | random | 0.943565 | 0.929670 |
8 | 25 | best | 0.928176 | 0.909890 |
9 | 25 | random | 0.943565 | 0.929670 |
10 | None | best | 0.928176 | 0.909890 |
11 | None | random | 0.943565 | 0.929670 |
Inspecting these results verifies that the highest \(F_1\) score is obtained with the best_params_
values of the parameters.
To round out this example, we calculate the test accuracy and \(F_1\) score with the chosen model.
= dtc_tuned.predict(X_test) y_pred
accuracy_score(y_test, y_pred)
0.9385964912280702
f1_score(y_test, y_pred)
0.951048951048951
Footnotes
A tricky exception is
LogisticRegression
which is used for the classification task.↩︎The
predict_proba
method is only available for (some) classification methods.↩︎The documentation for each model class in
sklearn
will specify the scorer used.↩︎We have not explored alternatives here, but as an example, you could instead consider a random search across a grid, which
sklearn
can accomplish withRandomizedSearchCV
.↩︎