In Pandas 1.1.4, I am receiving a DataError: No numeric types to aggregate when using an ExpandingGroupby.
Example dataset:
tmp = pd.DataFrame({'col1':['a','b','b','c','d','d'], 'col2': ['red','red','green','green','red','blue']})
print(tmp)
col1 col2
a red
b red
b green
c green
d red
d blue
This works:
tmp.groupby('col1').agg(lambda x: ','.join(x))
And this works:
tmp.groupby('col1').expanding().agg('count')
But this returns an error:
tmp.groupby('col1').expanding().agg(lambda x: ','.join(x))
DataError: No numeric types to aggregate
There is no conceptual reason this shouldn't work, and there are several references online to people using custom functions within an ExpandingGroupby.
There is obviously no reason that this should need to be a numeric, especially given that the count works with the non-numeric column. What is happening here? If this can't be done natively for whatever reason, how can I do it manually?
CodePudding user response:
You can use accumulate
from itertools
module:
from itertools import accumulate
concat = lambda *args: ','.join(args)
expand = lambda x: list(accumulate(x, func=concat))
df['col3'] = df.groupby('col1')['col2'].transform(expand)
print(df)
# Output
col1 col2 col3
0 a red red
1 b red red
2 b green red,green
3 c green green
4 d red red
5 d blue red,blue
Update
One line version:
df['col3'] = df.groupby('col1')['col2'].transform(lambda x: list(accumulate(x, func=lambda *args: ','.join(args))))
CodePudding user response:
If you want to join the previous rows values to the next inside the group, perhaps you could use cumsum
and add strings as you go:
tmp['expading_join'] = tmp.groupby('col1')['col2'].apply(lambda x: (x ',').cumsum()).str.rstrip(',')
Output:
col1 col2 expading_join
0 a red red
1 b red red
2 b green red,green
3 c green green
4 d red red
5 d blue red,blue
CodePudding user response:
A third option I have found:
tmp['col3'] = tmp.groupby('col1')['col2'].transform(lambda x: [';'.join(x[:i 1]) for i in range(len(x))])
Putting this out there in case it is useful to anyone; however, both the options by @enke and @Corralien are superior.
Tested on a large dataset, the times were:
accumulate: 0:13
apply: 0:25
for loop: 2:28
Since the accumulate option was faster and more intuitive, I am marking that as the accepted answer, although the other is very nice for being a one liner with no imports.