I have a DataFrame which I need to aggregate. The data can be of mixed type. I can easily achieve this for numeric data using a simple groupby.mean().
Example:
import pandas as pd
import numpy as np
d = {'ID':[1,1,1,2,2,2,3,3,3,3],
'ID2':['A','A','A','B','B','B','C','C','C','C'],
'p1':[1,2,1,100,1,1,10,3,2,1],
'p2':[1,2,1,100,1,1,10,3,2,2],
'p3':[1,2,1,100,1,1,10,3,2,5],
'p4':['A','A','A','B','B','B','C','C','X','X']
}
df = pd.DataFrame(d)
ID ID2 p1 p2 p3 p4
0 1 A 1 1 1 A
1 1 A 2 2 2 A
2 1 A 1 1 1 A
3 2 B 100 100 100 B
4 2 B 1 1 1 B
5 2 B 1 1 1 B
6 3 C 10 10 10 C
7 3 C 3 3 3 C
8 3 C 2 2 2 X
9 3 C 1 2 5 X
I then apply the following df2 = df.groupby(['ID','ID2']).mean()
which returns:
p1 p2 p3
ID ID2
1 A 1.333333 1.333333 1.333333
2 B 34.000000 34.000000 34.000000
3 C 4.000000 4.250000 5.000000
This is great however (not unexpected) does not accommodate 'p4' which is of type str. I would like to aggregate the string values where if all values are the same then keep the value else replace with NaN and include within the results set as follows:
p1 p2 p3 p4
ID ID2
1 A 1.333333 1.333333 1.333333 A
2 B 34.000000 34.000000 34.000000 B
3 C 4.000000 4.250000 5.000000 NaN
Additionally to this I potentially have unknown number of columns as such would like to do this by some form of type checking so not to have to explicitly state the param names.
Is there a pythonic way to achieve this without deconstructing the DataFrame into alpha/numerics?
CodePudding user response:
You can use a custom aggregation function:
dct = {
'p1': 'mean',
'p2': 'mean',
'p3': 'mean',
'p4': lambda col: col.mode() if col.nunique() == 1 else np.nan,
}
agg = df.groupby(['ID','ID2']).agg(**{k: (k, v) for k, v in dct.items()})
Or, by type:
dct = {
'number': 'mean',
'object': lambda col: col.mode() if col.nunique() == 1 else np.nan,
}
groupby_cols = ['ID','ID2']
dct = {k: v for i in [{col: agg for col in df.select_dtypes(tp).columns.difference(groupby_cols)} for tp, agg in dct.items()] for k, v in i.items()}
agg = df.groupby(groupby_cols).agg(**{k: (k, v) for k, v in dct.items()})
Output for both:
>>> agg
p1 p2 p3 p4
ID ID2
1 A 1.333333 1.333333 1.333333 A
2 B 34.000000 34.000000 34.000000 B
3 C 4.000000 4.250000 5.000000 NaN