I have a Pandas DF called “DF”. I would like to sample data from the population in such a way that, given a occurrence count, N = 100 and column = "Type", I would like to print a total of 100 rows from that column in such a way that the distribution of occurrences of each type is equal.
SNo | Type | Difficulty |
---|---|---|
1 | Single | 5 |
2 | Single | 15 |
3 | Single | 4 |
4 | Multiple | 2 |
5 | Multiple | 14 |
6 | None | 7 |
7 | None | 4323 |
For instance, If I specify N = 3, the output must be :
SNo | Type | Difficulty |
---|---|---|
1 | Single | 5 |
3 | Multiple | 4 |
6 | None | 7 |
If for the number N, the occurrences of certain types do not meet the minimum split, I can randomly increase another count.
I am wondering on how to approach this programmatically. Thanks!
CodePudding user response:
Use groupby.sample
(pandas ≥ 1.1) with N divided by the number of types.
NB. This assumes the N is a multiple of the number of types if you want a strict equality.
N = 3
N2 = N//df['Type'].nunique()
out = df.groupby('Type').sample(n=N2)
handling non multiple of the number of types
Use the same as above and complete to N with random rows excluding those already selected.
N = 5
N2, R = divmod(N, df['Type'].nunique())
out = df.groupby('Type').sample(n=N2)
out = pd.concat([out, df.drop(out.index).sample(n=R)])
As there is still a chance that you complete with items of the same group, if you really want to ensure sampling from different groups replace the last step with:
out = pd.concat([out, df.drop(out.index).groupby('Type').sample(n=1).sample(n=R)]
Example output:
SNo Type Difficulty
4 5 Multiple 14
6 7 None 4323
2 3 Single 4
3 4 Multiple 2
5 6 None 14