I have the following dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Name' : ['Jake', 'Nate', '', 'Alex', '', 'Max', 'Nate', 'Jake'],
'Color' : ['', 'red;blue', 'blue;pink', 'green;blue;red', '', '', 'blue', 'red;yellow'],
'Value_1' : [1211233.419, 4007489.726, 953474.6894, np.NaN, 1761987.704, 222600361, 404419.2243, 606066.067 ],
'Value_2' : [np.NaN, 1509907.457, 4792269.911, 43486.59312, np.NaN, np.NaN, 2066645.251, 60988660.37],
'Value_3' : [1175299.998, np.NaN, 1888559.459, np.NaN, 444689.0177, 405513.0572, 343704.0269, 2948494.383]})
---
Name Color Value_1 Value_2 Value_3
0 Jake 1.211233e 06 NaN 1.175300e 06
1 Nate red;blue 4.007490e 06 1.509907e 06 NaN
2 blue;pink 9.534747e 05 4.792270e 06 1.888559e 06
3 Alex green;blue;red NaN 4.348659e 04 NaN
4 1.761988e 06 NaN 4.446890e 05
5 Max 2.226004e 08 NaN 4.055131e 05
6 Nate blue 4.044192e 05 2.066645e 06 3.437040e 05
7 Jake red;yellow 6.060661e 05 6.098866e 07 2.948494e 06
I need two things:
1)In the first case I need to add all the values (Value_1, Value_2, Value_3) where I have the same name and get for example:
Name Value_1 Value_2 Value_3
0 Jake 1.817299e 06 6.098866e 07 4.123794e 06
1 Nate 4.411909e 06 3.576553e 06 3.437040e 05
2 Alex NaN 4.348659e 04 NaN
3 Max 2.226004e 08 NaN 4.055131e 05
2)I need the same thing but with the values of the name column plus the splits of the color column (only if there is at least one name and one color in the same row):
Name Color Value_1 Value_2 Value_3
0 Alex green NaN 4.348659e 04 NaN
1 Alex blue NaN 4.348659e 04 NaN
3 Alex red NaN 4.348659e 04 NaN
4 Jake red 6.060661e 05 6.098866e 07 2.948494e 06
5 Jake yellow 6.060661e 05 6.098866e 07 2.948494e 06
6 Nate red 4.007490e 06 1.509907e 06 NaN
7 Nate blue 4.411909e 06 3.576553e 06 3.437040e 05
(Note that in this case the only line present twice is Nate-Blue)
CodePudding user response:
df1['Color'] = df1['Color'].apply(lambda x: x.split(';'))
df1.explode('Color')
CodePudding user response:
You can use:
(df1.assign(Color=df1['Color'].str.split(';'))
.explode('Color')
.groupby(['Name', 'Color'], as_index=False)
.sum()
.replace('', pd.NA).dropna()
)
output:
Name Color Value_1 Value_2 Value_3
3 Alex blue 0.000000e 00 4.348659e 04 0.000000e 00
4 Alex green 0.000000e 00 4.348659e 04 0.000000e 00
5 Alex red 0.000000e 00 4.348659e 04 0.000000e 00
7 Jake red 6.060661e 05 6.098866e 07 2.948494e 06
8 Jake yellow 6.060661e 05 6.098866e 07 2.948494e 06
10 Nate blue 4.411909e 06 3.576553e 06 3.437040e 05
11 Nate red 4.007490e 06 1.509907e 06 0.000000e 00
CodePudding user response:
First replace empty strings in first 2 columns to mising values:
df1[['Name','Color']] = df1[['Name','Color']].replace('', np.nan)
Then aggregate sum
with min_count=1
for missing values instead 0
:
df2 = df1.groupby('Name', as_index=False).sum(min_count=1)
print (df2)
Name Value_1 Value_2 Value_3
0 Alex NaN 4.348659e 04 NaN
1 Jake 1.817299e 06 6.098866e 07 4.123794e 06
2 Max 2.226004e 08 NaN 4.055131e 05
3 Nate 4.411909e 06 3.576553e 06 3.437040e 05
For second ouput first use Series.str.split
with DataFrame.explode
and then aggregate sum
:
df3 = (df1.assign(Color=df1['Color'].str.split(';'))
.explode('Color')
.groupby(['Name', 'Color'], as_index=False)
.sum(min_count=1))
print (df3)
Name Color Value_1 Value_2 Value_3
0 Alex blue NaN 4.348659e 04 NaN
1 Alex green NaN 4.348659e 04 NaN
2 Alex red NaN 4.348659e 04 NaN
3 Jake red 6.060661e 05 6.098866e 07 2.948494e 06
4 Jake yellow 6.060661e 05 6.098866e 07 2.948494e 06
5 Nate blue 4.411909e 06 3.576553e 06 3.437040e 05
6 Nate red 4.007490e 06 1.509907e 06 NaN