Home > other >  Writing custom pandas aggfunc without making all dtypes object
Writing custom pandas aggfunc without making all dtypes object

Time:06-04

I (think I) need to write a custom aggregation function for the geopandas.GeoDataFrame.dissolve() operation. When merging multiple polygons, I want to keep the information of the polygon with the largest area, that also fulfils other criteria. The operation works fine, but afterwards all attributes of my GeoDataFrame are of dtype object.

The same issue happens with regular pandas groupy(), so I have simplified the example below. Can someone tell me if I should write my custom_sort() differently, to keep the dtypes intact?

import pandas as pd

df = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B'],
    'ints': [1, 2, 3, 4],
    'floats': [1.0, 2.0, 2.2, 3.2],
    'strings': ['foo', 'bar', 'baz', 'qux'],
    'bools': [True, True, True, False],
    'test': ['drop this', 'keep this', 'keep this', 'drop this'],
    })


def custom_sort(df):
    """Define custom aggregation function with special sorting."""
    df = df.sort_values(by=['bools', 'floats'], ascending=False)
    return df.iloc[0]


print(df)
print(df.dtypes)
print()
grouped = df.groupby(by='group').agg(custom_sort)
print(grouped)
print(grouped.dtypes)  # Issue: All dtypes are object
print()
print(grouped.convert_dtypes().dtypes)  # Possible solution, but not for me

# Please note that I cannot use convert_dtypes(). I actually need this for
# geopandas.GeoDataFrame.dissolve() and I think convert_dtypes() messes up
# the geometry information

Output:

  group  ints  floats strings  bools       test
0     A     1     1.0     foo   True  drop this
1     A     2     2.0     bar   True  keep this
2     B     3     2.2     baz   True  keep this
3     B     4     3.2     qux  False  drop this
group       object
ints         int64
floats     float64
strings     object
bools         bool
test        object
dtype: object

      ints floats strings bools       test
group                                     
A        2    2.0     bar  True  keep this
B        3    2.2     baz  True  keep this
ints       object
floats     object
strings    object
bools      object
test       object
dtype: object

ints         Int64
floats     Float64
strings     string
bools      boolean
test        string
dtype: object

CodePudding user response:

The source of the problem is that df.iloc[0] returns a pandas series. This series has multiple values in it, with different dtypes. Automatically, pandas may convert the dtype of the series to object. If I recall correctly, this depends on the version of the pandas library you're working with. Changes have been made to this behavior over time.

The solution to your problem heavily depends on the operations you're doing in your custom agg function.

In your toy example, I would suggest manipulating your dataframe beforehand, and using the simples possible aggregating function.

For example, anticipating the complex logic gives a simple head as agg:

(df.sort_values(by=['bools', 'floats'], 
               ascending=False)
   .groupby(by='group')
   .agg('first')

For what is worth, I'd also suggest you use more recent pandas versions.

  • Related