I (think I) need to write a custom aggregation function for the geopandas.GeoDataFrame.dissolve() operation. When merging multiple polygons, I want to keep the information of the polygon with the largest area, that also fulfils other criteria. The operation works fine, but afterwards all attributes of my GeoDataFrame are of dtype object
.
The same issue happens with regular pandas groupy()
, so I have simplified the example below. Can someone tell me if I should write my custom_sort()
differently, to keep the dtypes intact?
import pandas as pd
df = pd.DataFrame({
'group': ['A', 'A', 'B', 'B'],
'ints': [1, 2, 3, 4],
'floats': [1.0, 2.0, 2.2, 3.2],
'strings': ['foo', 'bar', 'baz', 'qux'],
'bools': [True, True, True, False],
'test': ['drop this', 'keep this', 'keep this', 'drop this'],
})
def custom_sort(df):
"""Define custom aggregation function with special sorting."""
df = df.sort_values(by=['bools', 'floats'], ascending=False)
return df.iloc[0]
print(df)
print(df.dtypes)
print()
grouped = df.groupby(by='group').agg(custom_sort)
print(grouped)
print(grouped.dtypes) # Issue: All dtypes are object
print()
print(grouped.convert_dtypes().dtypes) # Possible solution, but not for me
# Please note that I cannot use convert_dtypes(). I actually need this for
# geopandas.GeoDataFrame.dissolve() and I think convert_dtypes() messes up
# the geometry information
Output:
group ints floats strings bools test
0 A 1 1.0 foo True drop this
1 A 2 2.0 bar True keep this
2 B 3 2.2 baz True keep this
3 B 4 3.2 qux False drop this
group object
ints int64
floats float64
strings object
bools bool
test object
dtype: object
ints floats strings bools test
group
A 2 2.0 bar True keep this
B 3 2.2 baz True keep this
ints object
floats object
strings object
bools object
test object
dtype: object
ints Int64
floats Float64
strings string
bools boolean
test string
dtype: object
CodePudding user response:
The source of the problem is that df.iloc[0]
returns a pandas series. This series has multiple values in it, with different dtypes. Automatically, pandas may convert the dtype of the series to object
. If I recall correctly, this depends on the version of the pandas library you're working with. Changes have been made to this behavior over time.
The solution to your problem heavily depends on the operations you're doing in your custom agg
function.
In your toy example, I would suggest manipulating your dataframe beforehand, and using the simples possible aggregating function.
For example, anticipating the complex logic gives a simple head
as agg:
(df.sort_values(by=['bools', 'floats'],
ascending=False)
.groupby(by='group')
.agg('first')
For what is worth, I'd also suggest you use more recent pandas
versions.