For our demonstration, we’ll just use the ten digits dataset from sklearn. Pendigits dataset consists of 10 classes from digit 0 to digit 9.
from sklearn.datasets import load_digits
digits = load_digits()
print(digits.data.shape)
print(digits.target.shape)
Output looks like -
(1797, 64)
(1797,)
So each digit consists of some sample dataset. I would like to have a subsample of each class from the dataset. For example from digit 0 to digit 9, I need 50 subsamples of each class present in the dataset.
print(digits.data.shape)
print(digits.target.shape)
Result should be(50 subsample * 10 class = 500 subsample) -
(500,64)
(500)
Result should consist of subsample of each class available in the dataset.
CodePudding user response:
One option can be to use sklearn.model_selection.train_test_split
and use stratify
on labels (here : digits.target)
to split data in a stratified fashion.
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
import numpy as np
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
stratify=digits.target,
train_size=500)
# we can check labels that split equality like below
print(np.unique(y_train, return_counts=True))
# (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
# array([50, 51, 49, 51, 50, 51, 50, 50, 48, 50]))
print(X_train.shape)
# (500, 64)
print(y_train.shape)
# (500,)
CodePudding user response:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
digits = load_digits()
X, _, y, _ = train_test_split(
digits.data, digits.target,
stratify=digits.target, train_size=500
)
X.shape, y.shape #((500, 64), (500,))