python pandas groupby slow-CodePudding

I have a dataframe which I have grouped based on a column, I have to then merge these "grouped dataframes" to newer dataframes but under the condition that the newer dataframe must not have more than x rows (3 in this case), if it exceeds the count, I create a new df (else part in my code). I think I have a code that does this, but this is slow on my actual dataset, 300000 rows in dataframe.

Code

import pandas as pd

a = [1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3]
df = pd.DataFrame({"a": a})

dfs = [pd.DataFrame()]

group = (df['a'] != df['a'].shift()).cumsum()

for i, j in df.groupby(group):
    curRow = j.shape[0]
    prevRow = dfs[-1].shape[0]
    # is the new size greater than 3
    if curRow   prevRow <= 3:
        # less than 3 so add to the preivous df
        dfs[-1] = dfs[-1].append(j)
    else:
        # greater than 3 so add this as a new df
        dfs.append(j)

Expected Output

"dfs" will have 25 dataframes but faster than current code

Current Output

"dfs" will have 25 dataframes

For those asking the logic of group by

It is basically a itertools.groupby when you give a sequence that is not sorted,

Make Pandas groupby act similarly to itertools groupby

CodePudding user response：

You should never append to DataFrames, especially in a loop, it's very very slow. That is most likely the bottleneck.

One way to avoid that is to iterate over the old group sizes, and iteratively determine the new group labels of each row according to the constraint. Then pass those labels to DataFrame.groupby to form the new groups.

This method should be a lot faster for large DataFrames.

a = [1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3]
df = pd.DataFrame({"a": a})

group = (df['a'] != df['a'].shift()).cumsum()
old_group_sizes = df.groupby(group).size().tolist() 

# new group labels to pass to groupby (one for each row)
new_labels = []
# initialize the new group labels (integers)  
group_label = 0
# current group size 
group_size = 0

# iterate over the old groups's sizes 
for s in old_group_sizes:
    # increment the size of the current group
    group_size  = s  
    if group_size > 3:     # if it exceeds the threshold 
        group_label  = 1   # start a new group 
        group_size = s     # and update the new group size 
    # set the new group label for each row of the old group
    new_labels.extend([group_label]*s)

# create a list of the new groups/ DataFrames
dfs = [g for k, g in df.groupby(new_labels)]

Output

# number of new groups
>>> len(dfs)
25

# first group
>>> dfs[0]

   a
0  1
1  1
2  2

# second group
>>> dfs[1]

   a
3  3
4  2
5  2

# group label of each row 
>>> df['group num'] = df.groupby(new_labels).ngroup()
>>> df.head(20)

    a  group num
0   1          0
1   1          0
2   2          0
3   3          1
4   2          1
5   2          1
6   3          2
7   1          2
8   1          2
9   2          3
10  3          3
11  2          4
12  2          4
13  3          4
14  1          5
15  1          5
16  2          5
17  3          6
18  2          6
19  2          6

Performance Comparasion

I compared your solution with mine for a DataFrame of 300k rows.

Setup

Here is the setup used.

import pandas as pd
import numpy as np

# my solution
def create_new_labels(df):
    group = (df['a'] != df['a'].shift()).cumsum()
    old_group_sizes = df.groupby(group).size().tolist() 
    new_labels = []
    group_label = 0
    group_size = 0

    for s in old_group_sizes:
        group_size  = s  
        if group_size > 3:    
            group_label  = 1   
            group_size = s    
        new_labels.extend([group_label]*s)
            
    dfs = [g for k, g in df.groupby(new_labels)]
    return dfs

# OP solution
def append_rows(df):
    group = (df['a'] != df['a'].shift()).cumsum()
    dfs = [pd.DataFrame()]
    for i, j in df.groupby(group):
        curRow = j.shape[0]
        prevRow = dfs[-1].shape[0]
        if curRow   prevRow <= 3:
            dfs[-1] = dfs[-1].append(j)
        else:
            dfs.append(j)
    return dfs

n = 300_000

df = pd.DataFrame({'a': np.random.randint(1, 4, size=n)})

>>> df.shape

(300000, 1)

Results

Here are the results.

>>> %timeit append_rows(df)

50.1 s ± 996 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit create_new_labels(df)

2.92 s ± 113 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This means that on my machine, my solution is ~ 17 times faster.