Home > database >  Sum value in specific combinations of rows
Sum value in specific combinations of rows

Time:09-07

I have the following dataframe:

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name' : ['Jake', 'Nate', '', 'Alex', '', 'Max', 'Nate', 'Jake'],
                    'Color' : ['', 'red;blue', 'blue;pink', 'green;blue;red', '', '', 'blue', 'red;yellow'],
                    'Value_1' : [1211233.419, 4007489.726, 953474.6894, np.NaN, 1761987.704, 222600361, 404419.2243, 606066.067 ],
                    'Value_2' : [np.NaN, 1509907.457, 4792269.911, 43486.59312, np.NaN, np.NaN, 2066645.251, 60988660.37],
                    'Value_3' : [1175299.998, np.NaN, 1888559.459, np.NaN, 444689.0177, 405513.0572, 343704.0269, 2948494.383]})
---
   Name           Color       Value_1       Value_2       Value_3
0  Jake                  1.211233e 06           NaN  1.175300e 06
1  Nate        red;blue  4.007490e 06  1.509907e 06           NaN
2             blue;pink  9.534747e 05  4.792270e 06  1.888559e 06
3  Alex  green;blue;red           NaN  4.348659e 04           NaN
4                        1.761988e 06           NaN  4.446890e 05
5   Max                  2.226004e 08           NaN  4.055131e 05
6  Nate            blue  4.044192e 05  2.066645e 06  3.437040e 05
7  Jake      red;yellow  6.060661e 05  6.098866e 07  2.948494e 06

I need two things:

1)In the first case I need to add all the values (Value_1, Value_2, Value_3) where I have the same name and get for example:

   Name       Value_1       Value_2       Value_3
0  Jake  1.817299e 06  6.098866e 07  4.123794e 06
1  Nate  4.411909e 06  3.576553e 06  3.437040e 05
2  Alex           NaN  4.348659e 04           NaN
3   Max  2.226004e 08           NaN  4.055131e 05

2)I need the same thing but with the values of the name column plus the splits of the color column (only if there is at least one name and one color in the same row):

   Name           Color       Value_1       Value_2       Value_3
0  Alex           green           NaN  4.348659e 04           NaN
1  Alex            blue           NaN  4.348659e 04           NaN
3  Alex             red           NaN  4.348659e 04           NaN
4  Jake             red  6.060661e 05  6.098866e 07  2.948494e 06
5  Jake          yellow  6.060661e 05  6.098866e 07  2.948494e 06
6  Nate             red  4.007490e 06  1.509907e 06           NaN
7  Nate            blue  4.411909e 06  3.576553e 06  3.437040e 05

(Note that in this case the only line present twice is Nate-Blue)

CodePudding user response:

df1['Color'] = df1['Color'].apply(lambda x: x.split(';'))
df1.explode('Color')

CodePudding user response:

You can use:

(df1.assign(Color=df1['Color'].str.split(';'))
    .explode('Color')
    .groupby(['Name', 'Color'], as_index=False)
    .sum()
    .replace('', pd.NA).dropna()
)

output:

    Name   Color       Value_1       Value_2       Value_3
3   Alex    blue  0.000000e 00  4.348659e 04  0.000000e 00
4   Alex   green  0.000000e 00  4.348659e 04  0.000000e 00
5   Alex     red  0.000000e 00  4.348659e 04  0.000000e 00
7   Jake     red  6.060661e 05  6.098866e 07  2.948494e 06
8   Jake  yellow  6.060661e 05  6.098866e 07  2.948494e 06
10  Nate    blue  4.411909e 06  3.576553e 06  3.437040e 05
11  Nate     red  4.007490e 06  1.509907e 06  0.000000e 00

CodePudding user response:

First replace empty strings in first 2 columns to mising values:

df1[['Name','Color']] = df1[['Name','Color']].replace('', np.nan)

Then aggregate sum with min_count=1 for missing values instead 0:

df2 = df1.groupby('Name', as_index=False).sum(min_count=1)
print (df2)
   Name       Value_1       Value_2       Value_3
0  Alex           NaN  4.348659e 04           NaN
1  Jake  1.817299e 06  6.098866e 07  4.123794e 06
2   Max  2.226004e 08           NaN  4.055131e 05
3  Nate  4.411909e 06  3.576553e 06  3.437040e 05

For second ouput first use Series.str.split with DataFrame.explode and then aggregate sum:

df3 = (df1.assign(Color=df1['Color'].str.split(';'))
          .explode('Color')
          .groupby(['Name', 'Color'], as_index=False)
          .sum(min_count=1))
print (df3)
   Name   Color       Value_1       Value_2       Value_3
0  Alex    blue           NaN  4.348659e 04           NaN
1  Alex   green           NaN  4.348659e 04           NaN
2  Alex     red           NaN  4.348659e 04           NaN
3  Jake     red  6.060661e 05  6.098866e 07  2.948494e 06
4  Jake  yellow  6.060661e 05  6.098866e 07  2.948494e 06
5  Nate    blue  4.411909e 06  3.576553e 06  3.437040e 05
6  Nate     red  4.007490e 06  1.509907e 06           NaN
  • Related