How to optimize grouping of a DataFrame and performing operations on the groups-CodePudding

Here's an example of my dataframe:

d = {'group': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'd', 'd'], \
     'round': [3, 3, 2, 1, 3, 1, 3, 3, 3, 2, 1], \
     'score': [0.3, 0.1, 0.6, 0.8, 0.2, 0.5, 0.5, 0.6, 0.4, 0.9, 0.1]}
df = pd.DataFrame(d)
df

    group   round   score
0   a          3    0.3
1   a          3    0.1
2   a          2    0.6
3   b          1    0.8
4   b          3    0.2
5   b          1    0.5
6   b          3    0.5
7   b          3    0.6
8   c          3    0.4
9   d          2    0.9
10  d          1    0.1

My actual dataframe has 6 columns and > 1,000,000 rows. I'm trying to figure out the fastest way to do the following:

For each group find the average of scores and perform some calculation with it for each of 3 rounds. If there are no scores, write 'NA'.

I'm not sure if it would be faster to make a list of lists and then convert it into a dataframe or make a new dataframe and populate that, so i went with the list first:

def test_df(data):
    value_counts = data['group'].value_counts().to_dict()
    avgs = []

    for key, val in value_counts.items():
        row = data[data['group'] == key]
        
        x = [key]
        
        if val < 2:
            x.extend([10 * row['score'].values[0]   1 if i == row['round'].values[0] else 'NA' for i in range (1,4)])
        
        else:
            x.extend([(10 * row[row['round'] == i]['score'].mean()   1) if len(row[row['round'] == i]) > 0 else 'NA' for i in range(1, 4)])
            
        avgs.append(x)     
            
    return avgs

Here I created a separate case because about 80% of groups in my data only have one row, so I figured it might speed things up maybe?

this returns the correct results in format [group, round 1, round 2, round 3]

[['b', 7.5, 'NA', 5.333333333333333],
 ['a', 'NA', 7.0, 3.0],
 ['d', 2.0, 10.0, 'NA'],
 ['c', 'NA', 'NA', 5.0]]

but it's looking like it's going to take a really really long time on the actual dataframe... Does anyone have any better ideas?

CodePudding user response：

It looks to me like you're basically going a groupby/mean and a pivot.

import pandas as pd
d = {'group': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'd', 'd'], \
     'round': [3, 3, 2, 1, 3, 1, 3, 3, 3, 2, 1], \
     'score': [0.3, 0.1, 0.6, 0.8, 0.2, 0.5, 0.5, 0.6, 0.4, 0.9, 0.1]}
df = pd.DataFrame(d)

df = (df.groupby(['group','round'])['score'].mean()*10 1).reset_index()
df.pivot_table(index='group',columns='round',values='score', fill_value='NA').reset_index().values

Output

array([['a', 'NA', 7.0, 3.0],
       ['b', 7.5, 'NA', 5.333333333333333],
       ['c', 'NA', 'NA', 5.0],
       ['d', 2.0, 10.0, 'NA']], dtype=object)

CodePudding user response：

The imbalanced dataset may show different results, but I tested with the blow scripts and found out even with the pandas dataframe, the result shows okay performance. However, you can always compare it with the native python data structure.

import random
import datetime
import pandas as pd

def generate_data():  # augmentation
    data = {'group': [], 'round': [], 'score': []}
    for index in range(10 ** 6):  # sample size
        data['group'].append(random.choice(['a', 'b', 'c', 'd']))
        data['round'].append(random.randrange(1, 4))
        data['score'].append(round(random.random(), 1))
    return data

def calc_with_native_ds(data):  # native python data structure
    pass

def calc_with_pandas_df(df):  # pandas dataframe
    return df.groupby(['group', 'round']).mean()

if __name__ == '__main__':

    data = generate_data()
    df = pd.DataFrame(data)
    print(df.shape)

    start_datetime = datetime.datetime.now()

    # calc_with_native_ds(data)
    calc_with_pandas_df(df)

    end_datetime = datetime.datetime.now()
    elapsed_time = round((end_datetime - start_datetime).total_seconds(), 5)
    print(f"elapsed_time: {elapsed_time}")