This question is similar to this one here, but applied to pandas
df = pd.DataFrame({'tid': [0]*44 [2]*66, 'fidx': list(range(44)) list(range(66))})
I need to sample 10 'fidx' per 'tid' such that each fidx is futhers apart.
I figure out how to do it like this; however, I think this can be done with df.groupby
and some other functions but I can't seem to figure it out.
def sampling(df):
mins = df['tid'].drop_duplicates().index
maxes = df['tid'].drop_duplicates(keep='last').index
frames = []
for mi, ma in zip(mins, maxes):
frames.append([mi int(x*(ma-mi)/10) for x in range(10)])
frames = list(chain(*frames))
return frames
The worst part is having to flatten the list at the end.
Expected output
df.iloc[frames, :]
tid fidx
0 0 0
4 0 4
8 0 8
12 0 12
17 0 17
21 0 21
25 0 25
30 0 30
34 0 34
38 0 38
44 2 1
50 2 14
57 2 21
63 2 27
70 2 34
76 2 40
83 2 47
89 2 53
96 2 60
102 2 66
10 fidx for each tid and the fidx are as evenly separated as possible
CodePudding user response:
Here's one possible way. groupby
cumcount
gives a numbering to fidx
in each group. Then groupby
count
values divided by 10 gives the spacing between numbers. Then we index the modulo 0 of these numbers:
g = df.groupby(['tid'])['fidx']
out = df[g.cumcount().mod(g.transform('count').div(10).round(0)) == 0]
Output:
tid fidx
0 0 0
4 0 4
8 0 8
12 0 12
16 0 16
20 0 20
24 0 24
28 0 28
32 0 32
36 0 36
40 0 40
44 2 0
51 2 7
58 2 14
65 2 21
72 2 28
79 2 35
86 2 42
93 2 49
100 2 56
107 2 63