How to parallelize filtering/sorting after groupby?-CodePudding

I want to parallelize some filterings/sortings after groupby.

        groups = df.groupby(['date'])
        print('here, %s' % type(groups))
        new_df = pd.DataFrame()

        tt = time.time()
        for name, group in groups:
            # filter out the data by overlapping area
            group = group[group['intersection_area'] >= 80]
            # get scenes which are higher than 10 %
            group = group[group['clear_percent'] >= 10]
            # sort by clear percent
            group.sort_values(['clear_percent'])

            if not group.empty:
                # get the first scene
                group = group.head(1)
                new_df = new_df.append(group)
        ids = new_df['scene_id'].to_list()

How can I parallelize all codes in the for loop? I have reviewed some materials through the stack, unfortunately, most of them related with apply, sum, mean which i am not using.

CodePudding user response：

Filter and sort first. It's worth noting that sort defaults to ascending, so if you actually want the top clear_percent you'll need to add ascending=False to your sort.

import pandas as pd

df = pd.DataFrame({'date':[1,1,2,3],
                  'intersection_area':[90,70,50,85],
                  'clear_percent':[15,25,7,18],
                  'scene_id':[100,101,102,103]})

ids = (
    df.loc[(df['intersection_area'].ge(80)) & (df['clear_percent'].ge(10))]
    .sort_values(by='clear_percent')
    .groupby('date')
    .head(1)
    .scene_id
    .to_list()
)

print(ids)

output

[100, 103]