Sample selection with minimum class size-CodePudding

I would like to have at least 4 sampels of each letter. However, the samples should be selected by id so that the ones with the highest matches are used first. (maximum value on row level).

I have already tried dragging the sum to row level and then sampling from highest to lowest value. However, the minimum class size cannot be ensured this way. The remaining samples to fill the class can be chosen randomly.

Does anyone have an idea how I can solve this in pandas?


id  A   B   C   D   E
0   1   1   0   0   0
1   1   0   0   1   0
2   0   1   1   0   0
3   0   1   0   0   0
4   1   1   1   0   1
5   0   1   0   1   1
6   0   0   0   1   1
7   1   0   0   0   1
8   1   0   1   1   1
9   0   0   0   0   0
10  1   1   0   0   0
11  1   0   0   1   0
12  0   1   1   0   0
13  0   1   0   0   0
14  1   1   1   0   1
15  0   1   0   1   1
16  0   0   0   1   1
17  1   0   0   0   1
18  1   0   1   1   1
19  0   0   0   0   0
                            
sum 10  10  6   8   10

CodePudding user response：

This is what I understood from your question. The dataframe is df.

Create a matches column:

df = df.assign(matches=df.sum(axis=1))

Sort the rows that contain the letter by number of matches:

letter = "A"
ans = df[df[letter] == 1].sort_values(by="matches", ascending=False)

If you need a random sample of that:

n = 4
ans = ans.sample(n)

Finally, if the sample needs to be sorted by matches again:

ans = ans.sort_values(by="matches", ascending=False)

For letter == "B":

>>> ans
    A  B  C  D  E  matches
15  0  1  0  1  1        3
10  1  1  0  0  0        2
0   1  1  0  0  0        2
3   0  1  0  0  0        1

CodePudding user response：

If I understand correctly, you are trying to pick at least 4 samples from each letter with the highest indices.

One way to pick the highest index is to rank an index by number of columns it appears in -

high_index = []
for col in df.columns:
    high_index.append(sorted(df[df[col] > 0].index.values, reverse=True))

This gives you the highest index for each column - i.e. for A, the highest index with a 1 is 18

# print(high_index)
[[18, 17, 14, 11, 10, 8, 7, 4, 1, 0],
 [15, 14, 13, 12, 10, 5, 4, 3, 2, 0],
 [18, 14, 12, 8, 4, 2],
 [18, 16, 15, 11, 8, 6, 5, 1],
 [18, 17, 16, 15, 14, 8, 7, 6, 5, 4]]

Then you can rank each of those indices by how many times they appear -

rank_index = list(enumerate([sum([x in y for y in high_index]) for x in range(df.index.size)]))

# print(rank_index)
[(0, 2),
 (1, 2),
 (2, 2),
 (3, 1),
 (4, 4),
 (5, 3),
 (6, 2),
 (7, 2),
 (8, 4),
 (9, 0),
 (10, 2),
 (11, 2),
 (12, 2),
 (13, 1),
 (14, 4),
 (15, 3),
 (16, 2),
 (17, 2),
 (18, 4),
 (19, 0)]

Or, index 0 appears with a 1 for 2 columns (you could get the same result by summing on axis=1 as well)

If you re-order the df by these ranks, then all rows where the minimum rank is 4 or more can be used for your sample

reordered_index = [x for x, y in sorted(rank_index, reverse=True, key=lambda x: x[1])]
df_2 = df.loc[reordered_index, :]
df_2 = df_2.cumsum()
df_2['min_samples'] = df_2.min(axis=1)

# print(df_2)
     A   B  C  D   E  min_samples
id                               
4    1   1  1  0   1            0
8    2   1  2  1   2            1
14   3   2  3  1   3            1
18   4   2  4  2   4            2
5    4   3  4  3   5            3
15   4   4  4  4   6            4
0    5   5  4  4   6            4
1    6   5  4  5   6            4
2    6   6  5  5   6            5
6    6   6  5  6   7            5
7    7   6  5  6   8            5
10   8   7  5  6   8            5
11   9   7  5  7   8            5
12   9   8  6  7   8            6
16   9   8  6  8   9            6
17  10   8  6  8  10            6
3   10   9  6  8  10            6
13  10  10  6  8  10            6
9   10  10  6  8  10            6
19  10  10  6  8  10            6

The smallest sample you can pick is the first 6 rows from df_2.