I have a dataframe which I have grouped based on a column, I have to then merge these "grouped dataframes" to newer dataframes but under the condition that the newer dataframe must not have more than x rows (3 in this case), if it exceeds the count, I create a new df (else part in my code). I think I have a code that does this, but this is slow on my actual dataset, 300000 rows in dataframe.
Code
import pandas as pd
a = [1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3]
df = pd.DataFrame({"a": a})
dfs = [pd.DataFrame()]
group = (df['a'] != df['a'].shift()).cumsum()
for i, j in df.groupby(group):
curRow = j.shape[0]
prevRow = dfs[-1].shape[0]
# is the new size greater than 3
if curRow prevRow <= 3:
# less than 3 so add to the preivous df
dfs[-1] = dfs[-1].append(j)
else:
# greater than 3 so add this as a new df
dfs.append(j)
Expected Output
"dfs" will have 25 dataframes but faster than current code
Current Output
"dfs" will have 25 dataframes
For those asking the logic of group by
It is basically a itertools.groupby
when you give a sequence that is not sorted,
Make Pandas groupby act similarly to itertools groupby
CodePudding user response:
You should never append to DataFrames, especially in a loop, it's very very slow. That is most likely the bottleneck.
One way to avoid that is to iterate over the old group sizes, and iteratively determine the new group labels of each row according to the constraint. Then pass those labels to DataFrame.groupby
to form the new groups.
This method should be a lot faster for large DataFrames.
a = [1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3]
df = pd.DataFrame({"a": a})
group = (df['a'] != df['a'].shift()).cumsum()
old_group_sizes = df.groupby(group).size().tolist()
# new group labels to pass to groupby (one for each row)
new_labels = []
# initialize the new group labels (integers)
group_label = 0
# current group size
group_size = 0
# iterate over the old groups's sizes
for s in old_group_sizes:
# increment the size of the current group
group_size = s
if group_size > 3: # if it exceeds the threshold
group_label = 1 # start a new group
group_size = s # and update the new group size
# set the new group label for each row of the old group
new_labels.extend([group_label]*s)
# create a list of the new groups/ DataFrames
dfs = [g for k, g in df.groupby(new_labels)]
Output
# number of new groups
>>> len(dfs)
25
# first group
>>> dfs[0]
a
0 1
1 1
2 2
# second group
>>> dfs[1]
a
3 3
4 2
5 2
# group label of each row
>>> df['group num'] = df.groupby(new_labels).ngroup()
>>> df.head(20)
a group num
0 1 0
1 1 0
2 2 0
3 3 1
4 2 1
5 2 1
6 3 2
7 1 2
8 1 2
9 2 3
10 3 3
11 2 4
12 2 4
13 3 4
14 1 5
15 1 5
16 2 5
17 3 6
18 2 6
19 2 6
Performance Comparasion
I compared your solution with mine for a DataFrame of 300k rows.
Setup
Here is the setup used.
import pandas as pd
import numpy as np
# my solution
def create_new_labels(df):
group = (df['a'] != df['a'].shift()).cumsum()
old_group_sizes = df.groupby(group).size().tolist()
new_labels = []
group_label = 0
group_size = 0
for s in old_group_sizes:
group_size = s
if group_size > 3:
group_label = 1
group_size = s
new_labels.extend([group_label]*s)
dfs = [g for k, g in df.groupby(new_labels)]
return dfs
# OP solution
def append_rows(df):
group = (df['a'] != df['a'].shift()).cumsum()
dfs = [pd.DataFrame()]
for i, j in df.groupby(group):
curRow = j.shape[0]
prevRow = dfs[-1].shape[0]
if curRow prevRow <= 3:
dfs[-1] = dfs[-1].append(j)
else:
dfs.append(j)
return dfs
n = 300_000
df = pd.DataFrame({'a': np.random.randint(1, 4, size=n)})
>>> df.shape
(300000, 1)
Results
Here are the results.
>>> %timeit append_rows(df)
50.1 s ± 996 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit create_new_labels(df)
2.92 s ± 113 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This means that on my machine, my solution is ~ 17 times faster.