Home > Software design >  Pandas - Equal occurrences of unique type for a column
Pandas - Equal occurrences of unique type for a column

Time:07-29

I have a Pandas DF called “DF”. I would like to sample data from the population in such a way that, given a occurrence count, N = 100 and column = "Type", I would like to print a total of 100 rows from that column in such a way that the distribution of occurrences of each type is equal.

SNo Type Difficulty
1 Single 5
2 Single 15
3 Single 4
4 Multiple 2
5 Multiple 14
6 None 7
7 None 4323

For instance, If I specify N = 3, the output must be :

SNo Type Difficulty
1 Single 5
3 Multiple 4
6 None 7

If for the number N, the occurrences of certain types do not meet the minimum split, I can randomly increase another count.

I am wondering on how to approach this programmatically. Thanks!

CodePudding user response:

Use groupby.sample (pandas ≥ 1.1) with N divided by the number of types.

NB. This assumes the N is a multiple of the number of types if you want a strict equality.

N = 3
N2 = N//df['Type'].nunique()

out = df.groupby('Type').sample(n=N2)

handling non multiple of the number of types

Use the same as above and complete to N with random rows excluding those already selected.

N = 5
N2, R = divmod(N, df['Type'].nunique())

out = df.groupby('Type').sample(n=N2)

out = pd.concat([out, df.drop(out.index).sample(n=R)])

As there is still a chance that you complete with items of the same group, if you really want to ensure sampling from different groups replace the last step with:

out = pd.concat([out, df.drop(out.index).groupby('Type').sample(n=1).sample(n=R)]

Example output:

   SNo      Type  Difficulty
4    5  Multiple          14
6    7      None        4323
2    3    Single           4
3    4  Multiple           2
5    6      None          14
  • Related