Home > OS >  Pandas groupby and sample evenly
Pandas groupby and sample evenly

Time:04-09

This question is similar to this one here, but applied to pandas

df = pd.DataFrame({'tid': [0]*44 [2]*66, 'fidx': list(range(44)) list(range(66))})

I need to sample 10 'fidx' per 'tid' such that each fidx is futhers apart. I figure out how to do it like this; however, I think this can be done with df.groupby and some other functions but I can't seem to figure it out.

def sampling(df):
    mins = df['tid'].drop_duplicates().index
    maxes = df['tid'].drop_duplicates(keep='last').index
    frames = []
    for mi, ma in zip(mins, maxes):
        frames.append([mi   int(x*(ma-mi)/10) for x in range(10)])

    frames = list(chain(*frames))
    return frames

The worst part is having to flatten the list at the end.

Expected output df.iloc[frames, :]

     tid  fidx
0      0     0
4      0     4
8      0     8
12     0    12
17     0    17
21     0    21
25     0    25
30     0    30
34     0    34
38     0    38
44     2     1
50     2    14
57     2    21
63     2    27
70     2    34
76     2    40
83     2    47
89     2    53
96     2    60
102    2    66

10 fidx for each tid and the fidx are as evenly separated as possible

CodePudding user response:

Here's one possible way. groupby cumcount gives a numbering to fidx in each group. Then groupby count values divided by 10 gives the spacing between numbers. Then we index the modulo 0 of these numbers:

g = df.groupby(['tid'])['fidx']
out = df[g.cumcount().mod(g.transform('count').div(10).round(0)) == 0]

Output:

     tid  fidx
0      0     0
4      0     4
8      0     8
12     0    12
16     0    16
20     0    20
24     0    24
28     0    28
32     0    32
36     0    36
40     0    40
44     2     0
51     2     7
58     2    14
65     2    21
72     2    28
79     2    35
86     2    42
93     2    49
100    2    56
107    2    63
  • Related