I want to parallelize some filterings/sortings after groupby.
groups = df.groupby(['date'])
print('here, %s' % type(groups))
new_df = pd.DataFrame()
tt = time.time()
for name, group in groups:
# filter out the data by overlapping area
group = group[group['intersection_area'] >= 80]
# get scenes which are higher than 10 %
group = group[group['clear_percent'] >= 10]
# sort by clear percent
group.sort_values(['clear_percent'])
if not group.empty:
# get the first scene
group = group.head(1)
new_df = new_df.append(group)
ids = new_df['scene_id'].to_list()
How can I parallelize all codes in the for
loop? I have reviewed some materials through the stack, unfortunately, most of them related with apply, sum, mean which i am not using.
CodePudding user response:
Filter and sort first. It's worth noting that sort defaults to ascending, so if you actually want the top clear_percent
you'll need to add ascending=False
to your sort.
import pandas as pd
df = pd.DataFrame({'date':[1,1,2,3],
'intersection_area':[90,70,50,85],
'clear_percent':[15,25,7,18],
'scene_id':[100,101,102,103]})
ids = (
df.loc[(df['intersection_area'].ge(80)) & (df['clear_percent'].ge(10))]
.sort_values(by='clear_percent')
.groupby('date')
.head(1)
.scene_id
.to_list()
)
print(ids)
output
[100, 103]