How to get a stratified random sample of indices?-CodePudding

I have an array (pd.Series) of two values (A's and B's, for example).

y = pd.Series(['A','B','A','A','B','B','A','B','A','B','B'])


0 A
1 B
2 A
3 A
4 B
5 B
6 A
7 B
8 A
9 B
10 B

I want to get a random sample of indices from series, but half of the indices must correspond with an A, and the other half must correspond with a B.

For example

get_random_stratified_sample_of_indices(y=y, n=4)

[0, 1, 2, 4]

The indices 0 and 2 correspond with the indices of A's, and the indices of 1 and 4 correspond with the indices of B's.

Another example

get_random_stratified_sample_of_indices(y=y, n=6)

[1, 4, 5, 0, 2, 3]

The order of the returned list of indices doesn't matter but I need it to be even split between indices of A's and B's from the y array.

My plan was to first look at the indices of A's, then take a random sample (size=n/2) of the indices. And then repeat for B.

CodePudding user response：

You can use groupby.sample:

N = 4

idx = (y
  .index.to_series()
  .groupby(y)
  .sample(n=N//len(y.unique()))
  .to_list()
 )

Output: [3, 8, 10, 1]

Check:

3     A
8     A
10    B
1     B
dtype: object

CodePudding user response：

Here's one way to do it:

def get_random_stratified_sample_of_indices(s, n):
    mask = s == 'A'
    s1 = s[mask]
    s2 = s[~mask]
    m1 = n // 2
    m2 = m1 if n % 2 == 0 else m1   1
    i1 = s1.sample(m1).index.to_list()
    i2 = s2.sample(m2).index.to_list()
    return i1   i2

Which could be used in this way:

y = pd.Series(['A','B','A','A','B','B','A','B','A','B','B'])
i = get_random_stratified_sample_of_indices(y, 5)
print(i)
print()
print(y[i])

Result:

[6, 2, 7, 10, 5]

6     A
2     A
7     B
10    B
5     B

CodePudding user response：

I think you could use the train_test_split from Scikit-Learn, defining its stratify parameter.


from sklearn.model_selection import train_test_split
import pandas as pd

y = (
    pd.Series(["A", "B", "A", "A", "B", "B", "A", "B", "A", "B", "B"])
    .T.to_frame("col")
    .assign(i=lambda xdf: xdf.index)
)
print(y)
# Prints:
#
#    col   i
# 0    A   0
# 1    B   1
# 2    A   2
# 3    A   3
# 4    B   4
# 5    B   5
# 6    A   6
# 7    B   7
# 8    A   8
# 9    B   9
# 10   B  10
print('\n')

# ===== Actual solution =====================================
a, b = train_test_split(y, test_size=0.5, stratify=y["col"])
# ===========================================================
print(a)
# Prints:
#
#    col   i
# 10   B  10
# 6    A   6
# 7    B   7
# 8    A   8
# 4    B   4

print('\n')
print(b)
# Prints:
#
#   col  i
# 3   A  3
# 9   B  9
# 2   A  2
# 1   B  1
# 5   B  5
# 0   A  0