Home > front end >  How to do string operations when aggregating a pandas dataframe?
How to do string operations when aggregating a pandas dataframe?

Time:03-03

I need to perform some aggregations on a pandas dataframe. I'm using pandas version 1.3.3.

It seems I am only able to use builtin python functions, such as the max function, to aggregate columns that contain strings. Trying to do the same thing using any custom function (even one that only calls the builtin max) causes an error, as shown in the example below.

Can anyone tell me what I'm doing wrong in this example, and what is the correct way to use a custom function for string aggregation?

import pandas as pd

# Define a dataframe with two columns - one with strings (a-e), one with numbers (1-5)
foo = pd.DataFrame(
    data={
        'string_col': ['a', 'b', 'c', 'd', 'e'],
        'num_col': [1,2,3,4,5]
    }
)


# Custom aggregation function to concatenate strings
def custom_aggregation_funcion(vals):
    return ", ".join(vals)


# This works - gives a pandas Series with string_col = e, and num_col = 5
a = foo.agg(func={'string_col': max, 'num_col': max})

# This crashes with 'ValueError: cannot perform both aggregation and transformation operations simultaneously'
b = foo.agg(func={'string_col': lambda x: max(x), 'num_col': max})

# Crashes with same error
c = foo.agg(func={'string_col': custom_aggregation_funcion, 'num_col': max})

CodePudding user response:

If you try to run:

foo['string_col'].agg(','.join)

you will see that you get back a Series:

0    a
1    b
2    c
3    d
4    e
Name: string_col, dtype: object

Indeed, your custom function is applied per element, not on the whole Series. Thus the "cannot perform both aggregation and transformation operations simultaneously".

You can change your function to:

# Custom aggregation function to concatenate strings
def custom_aggregation_funcion(vals):
    return ", ".join(vals.to_list())

c = foo.agg(func={'string_col': custom_aggregation_funcion, 'num_col': max})

output:

string_col    a, b, c, d, e
num_col                   5
dtype: object
  • Related