How to make a subsample from a dataset with equal labels, like the sklearn.digit dataset-CodePudding

For our demonstration, we’ll just use the ten digits dataset from sklearn. Pendigits dataset consists of 10 classes from digit 0 to digit 9.

from sklearn.datasets import load_digits
digits = load_digits()
print(digits.data.shape)
print(digits.target.shape)

Output looks like -

(1797, 64)
(1797,)

So each digit consists of some sample dataset. I would like to have a subsample of each class from the dataset. For example from digit 0 to digit 9, I need 50 subsamples of each class present in the dataset.

print(digits.data.shape)
print(digits.target.shape)

Result should be(50 subsample * 10 class = 500 subsample) -

(500,64)
(500)

Result should consist of subsample of each class available in the dataset.

CodePudding user response：

One option can be to use sklearn.model_selection.train_test_split and use stratify on labels (here : digits.target) to split data in a stratified fashion.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
import numpy as np

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    stratify=digits.target, 
                                                    train_size=500)

# we can check labels that split equality like below
print(np.unique(y_train, return_counts=True))
# (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
# array([50, 51, 49, 51, 50, 51, 50, 50, 48, 50]))

print(X_train.shape)
# (500, 64)

print(y_train.shape)
# (500,)

CodePudding user response：

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X, _, y, _ = train_test_split(
    digits.data, digits.target,
    stratify=digits.target, train_size=500
)
X.shape, y.shape #((500, 64), (500,))