Home > Blockchain >  Pandas: How to sum the values of a column according to a criteria without grouping or adding the oth
Pandas: How to sum the values of a column according to a criteria without grouping or adding the oth

Time:05-12

I have a dataframe which simplified looks like this

id  surface longitude latitude othercolumns
1   20      5.878     46.943   ...
1   15      5.88      46.7     ...
1   25      5.5       47       ...
2   30      9.6       40.2     ...
3   10      4.7       36       ...
3   25      3.8       41       ...

I read several posts advising groupby but in this case, to the best of my knowledge, it doesn't see to be adapted as it seems that only the variables grouped would be kept, and I need the other columns as well.

On advise I looked as well here, but the formula :

df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
                      for k,v in df.dtypes[1:].items()})

Seems (if I understand it correctly) to apply the calculation to all numeric fields, which would not make sense to sum geographic coordinates.

As a consequence I thought maybe by aiming at having:

(A)

id  surface longitude latitude othercolumns
1   60      5.878     46.943   ...
1   60      5.88      46.7     ...
1   60      5.5       47       ...
2   30      9.6       40.2     ...
3   35      4.7       36       ...
3   35      3.8       41       ...

(ex: 60 for the sum of 20 15 25 in the case of id 1)

Then applying df.drop_duplicates() so I can retain something like (B)

id  surface longitude latitude othercolumns
1   60      5.878     46.943   ...
2   30      9.6       40.2     ...
3   35      4.7       36       ...

However I don't know how to manage to get to (A), I hope I'm explaining this issue with clarity as I don't have the impression that I saw this brought out before.

CodePudding user response:

You get to your (A) datframe example by using 'transform' which transforms the grouped by operation back to the original dataframe.

Something like

df['surface']=df.groupby('id')['surface'].transform('sum')

CodePudding user response:

  1. Make a list of numeric_cols with list comprehension

  2. conditionally .agg the data with dictionary comprehension according to the dtype.

    numeric_cols = [col for col in df.columns[1:] if df[col].dtype == 'int' or 'float' ]
    (df.groupby('id', as_index=False)
     .agg({f'{col}' : 'sum' 
           if df[col].dtype == 'int' 
           else 'first' 
           for col in numeric_cols}))
    Out[1]: 
       id  surface  longitude  latitude
    0   1       60      5.878    46.943
    1   2       30      9.600    40.200
    2   3       35      4.700    36.000
    

You could also do this by column name:

numeric_cols = [col for col in df.columns[1:] if df[col].dtype == 'int' or 'float' ]
(df.groupby('id', as_index=False)
 .agg({f'{col}' : 'first' if col == 'latitude' or 'longitude'
   else 'sum' 
   for col in numeric_cols}))
  • Related