"No numeric types to aggregate" while using Pandas expanding()-CodePudding

In Pandas 1.1.4, I am receiving a DataError: No numeric types to aggregate when using an ExpandingGroupby.

Example dataset:

tmp = pd.DataFrame({'col1':['a','b','b','c','d','d'], 'col2': ['red','red','green','green','red','blue']})

print(tmp)

col1    col2
a       red
b       red
b       green
c       green
d       red
d       blue

This works:

tmp.groupby('col1').agg(lambda x: ','.join(x))

And this works:

tmp.groupby('col1').expanding().agg('count')

But this returns an error:

tmp.groupby('col1').expanding().agg(lambda x: ','.join(x))

DataError: No numeric types to aggregate

There is no conceptual reason this shouldn't work, and there are several references online to people using custom functions within an ExpandingGroupby.

There is obviously no reason that this should need to be a numeric, especially given that the count works with the non-numeric column. What is happening here? If this can't be done natively for whatever reason, how can I do it manually?

CodePudding user response：

You can use accumulate from itertools module:

from itertools import accumulate

concat = lambda *args: ','.join(args)
expand = lambda x: list(accumulate(x, func=concat))

df['col3'] = df.groupby('col1')['col2'].transform(expand)
print(df)

# Output
  col1   col2       col3
0    a    red        red
1    b    red        red
2    b  green  red,green
3    c  green      green
4    d    red        red
5    d   blue   red,blue

Update

One line version:

df['col3'] = df.groupby('col1')['col2'].transform(lambda x: list(accumulate(x, func=lambda *args: ','.join(args))))

CodePudding user response：

If you want to join the previous rows values to the next inside the group, perhaps you could use cumsum and add strings as you go:

tmp['expading_join'] = tmp.groupby('col1')['col2'].apply(lambda x: (x   ',').cumsum()).str.rstrip(',')

Output:

  col1   col2 expading_join
0    a    red           red
1    b    red           red
2    b  green     red,green
3    c  green         green
4    d    red           red
5    d   blue      red,blue

CodePudding user response：

A third option I have found:

tmp['col3'] = tmp.groupby('col1')['col2'].transform(lambda x: [';'.join(x[:i 1]) for i in range(len(x))])

Putting this out there in case it is useful to anyone; however, both the options by @enke and @Corralien are superior.

Tested on a large dataset, the times were:

accumulate: 0:13
apply: 0:25
for loop:  2:28

Since the accumulate option was faster and more intuitive, I am marking that as the accepted answer, although the other is very nice for being a one liner with no imports.