When building classifiers for your dataset, we often see that the dataset has imbalanced distribution of features. Some of these features can be very important for prediction of the labels. For example, if you knew that a certain city has a population of 47% female voters and 53% male voters and results of the election are a function of the distribution of male and female votes, then when you sample data for a survey (or for training and testing a machine learning classifier), its important that you maintain this distribution in your sampled data, to be a representative of the population. We call this “Stratified Sampling”.
Stratified Sampling is important as it guarantees that your dataset does not have an intrinsic bias and that it does represent the population.
Is there an easy way to divide the dataset into training and test dataset while maintaining the composition of the key feature?
There are two modules provided by Scikit-learn for Stratified Splitting:
StratifiedKFold : This module sets up n_folds of the dataset in a way that the samples are equally balanced in both training and test datasets.
>>> import numpy as np >>> from sklearn.model_selection import StratifiedKFold >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) >>> y = np.array([0, 0, 1, 1]) >>> skf = StratifiedKFold(n_splits=2) >>> skf.get_n_splits(X, y) 2 >>> print(skf) StratifiedKFold(n_splits=2, random_state=None, shuffle=False) >>> for train_index, test_index in skf.split(X, y): ... print("TRAIN:", train_index, "TEST:", test_index) ... X_train, X_test = X[train_index], X[test_index] ... y_train, y_test = y[train_index], y[test_index] TRAIN: [1 3] TEST: [0 2] TRAIN: [0 2] TEST: [1 3]
StratifiedShuffleSplit : This module on the other hand creates a single training / testing set having equally balanced classes. This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds
>>> import numpy as np >>> from sklearn.model_selection import StratifiedShuffleSplit >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]]) >>> y = np.array([0, 0, 0, 1, 1, 1]) >>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0) >>> sss.get_n_splits(X, y) 5 >>> print(sss) StratifiedShuffleSplit(n_splits=5, random_state=0, ...) >>> for train_index, test_index in sss.split(X, y): ... print("TRAIN:", train_index, "TEST:", test_index) ... X_train, X_test = X[train_index], X[test_index] ... y_train, y_test = y[train_index], y[test_index] TRAIN: [5 2 3] TEST: [4 1 0] TRAIN: [5 1 4] TEST: [0 2 3] TRAIN: [5 0 2] TEST: [4 3 1] TRAIN: [4 1 0] TEST: [2 3 5] TRAIN: [0 5 1] TEST: [3 4 2]
Stratification can also be achieved when splitting data by adding a relevant flag called “stratify”.
from sklearn.model_selection import train_test_split train, test = train_test_split(X, test_size=0.2, stratify=X['YOUR_COLUMN_LABEL'])
In a follow up post, i will also add some experiment data to show how the distribution changes with Random and Stratified shuffle splits.
Note : Someone asked me recently, how to handle Continuous data ? Well, for continous data, you should first do “binning” of the dataset and then apply stratification. I will try to cover this too in a follow up post.