I have a dataframe which simplified looks like this
id surface longitude latitude othercolumns
1 20 5.878 46.943 ...
1 15 5.88 46.7 ...
1 25 5.5 47 ...
2 30 9.6 40.2 ...
3 10 4.7 36 ...
3 25 3.8 41 ...
I read several posts advising groupby but in this case, to the best of my knowledge, it doesn't see to be adapted as it seems that only the variables grouped would be kept, and I need the other columns as well.
On advise I looked as well here, but the formula :
df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in df.dtypes[1:].items()})
Seems (if I understand it correctly) to apply the calculation to all numeric fields, which would not make sense to sum geographic coordinates.
As a consequence I thought maybe by aiming at having:
(A)
id surface longitude latitude othercolumns
1 60 5.878 46.943 ...
1 60 5.88 46.7 ...
1 60 5.5 47 ...
2 30 9.6 40.2 ...
3 35 4.7 36 ...
3 35 3.8 41 ...
(ex: 60 for the sum of 20 15 25 in the case of id 1)
Then applying df.drop_duplicates() so I can retain something like (B)
id surface longitude latitude othercolumns
1 60 5.878 46.943 ...
2 30 9.6 40.2 ...
3 35 4.7 36 ...
However I don't know how to manage to get to (A), I hope I'm explaining this issue with clarity as I don't have the impression that I saw this brought out before.
CodePudding user response:
You get to your (A) datframe example by using 'transform' which transforms the grouped by operation back to the original dataframe.
Something like
df['surface']=df.groupby('id')['surface'].transform('sum')
CodePudding user response:
Make a list of
numeric_cols
with list comprehensionconditionally
.agg
the data with dictionary comprehension according to the dtype.numeric_cols = [col for col in df.columns[1:] if df[col].dtype == 'int' or 'float' ] (df.groupby('id', as_index=False) .agg({f'{col}' : 'sum' if df[col].dtype == 'int' else 'first' for col in numeric_cols})) Out[1]: id surface longitude latitude 0 1 60 5.878 46.943 1 2 30 9.600 40.200 2 3 35 4.700 36.000
You could also do this by column name:
numeric_cols = [col for col in df.columns[1:] if df[col].dtype == 'int' or 'float' ]
(df.groupby('id', as_index=False)
.agg({f'{col}' : 'first' if col == 'latitude' or 'longitude'
else 'sum'
for col in numeric_cols}))