I have an array (pd.Series) of two values (A's and B's, for example).
y = pd.Series(['A','B','A','A','B','B','A','B','A','B','B'])
0 A
1 B
2 A
3 A
4 B
5 B
6 A
7 B
8 A
9 B
10 B
I want to get a random sample of indices from series, but half of the indices must correspond with an A, and the other half must correspond with a B.
For example
get_random_stratified_sample_of_indices(y=y, n=4)
[0, 1, 2, 4]
The indices 0 and 2 correspond with the indices of A's, and the indices of 1 and 4 correspond with the indices of B's.
Another example
get_random_stratified_sample_of_indices(y=y, n=6)
[1, 4, 5, 0, 2, 3]
The order of the returned list of indices doesn't matter but I need it to be even split between indices of A's and B's from the y array.
My plan was to first look at the indices of A's, then take a random sample (size=n/2) of the indices. And then repeat for B.
CodePudding user response:
You can use groupby.sample
:
N = 4
idx = (y
.index.to_series()
.groupby(y)
.sample(n=N//len(y.unique()))
.to_list()
)
Output: [3, 8, 10, 1]
Check:
3 A
8 A
10 B
1 B
dtype: object
CodePudding user response:
Here's one way to do it:
def get_random_stratified_sample_of_indices(s, n):
mask = s == 'A'
s1 = s[mask]
s2 = s[~mask]
m1 = n // 2
m2 = m1 if n % 2 == 0 else m1 1
i1 = s1.sample(m1).index.to_list()
i2 = s2.sample(m2).index.to_list()
return i1 i2
Which could be used in this way:
y = pd.Series(['A','B','A','A','B','B','A','B','A','B','B'])
i = get_random_stratified_sample_of_indices(y, 5)
print(i)
print()
print(y[i])
Result:
[6, 2, 7, 10, 5]
6 A
2 A
7 B
10 B
5 B
CodePudding user response:
I think you could use the train_test_split
from Scikit-Learn, defining its stratify
parameter.
from sklearn.model_selection import train_test_split
import pandas as pd
y = (
pd.Series(["A", "B", "A", "A", "B", "B", "A", "B", "A", "B", "B"])
.T.to_frame("col")
.assign(i=lambda xdf: xdf.index)
)
print(y)
# Prints:
#
# col i
# 0 A 0
# 1 B 1
# 2 A 2
# 3 A 3
# 4 B 4
# 5 B 5
# 6 A 6
# 7 B 7
# 8 A 8
# 9 B 9
# 10 B 10
print('\n')
# ===== Actual solution =====================================
a, b = train_test_split(y, test_size=0.5, stratify=y["col"])
# ===========================================================
print(a)
# Prints:
#
# col i
# 10 B 10
# 6 A 6
# 7 B 7
# 8 A 8
# 4 B 4
print('\n')
print(b)
# Prints:
#
# col i
# 3 A 3
# 9 B 9
# 2 A 2
# 1 B 1
# 5 B 5
# 0 A 0