Home > Net >  How to map multiple datasets based on the same row values of one column?
How to map multiple datasets based on the same row values of one column?

Time:12-14

I have three datasets,

df1
   name              A         B         C         D  
0  Jeffrey Ray       0.171130  0.307791 -0.031378  0.366607
1  Crystal Roberts   0.429614  1.674201  0.176098 -1.289623
2  Paul Choute       0.396332 0.879003  1.889926  0.745100
3  Amy Clements      1.452253  0.724694 -0.830645 -0.342138
4  Stanley Coston    -0.135439 -0.293293  0.684627  0.538049

df2
   name              A         B         C         D  
0  Jeffrey Ray       0.173430  0.399791 -0.012278  0.88897
1  Sara Flicker      0.635744  1.699001  0.176098 -1.11113
2  Jack Sether      0.323333 0.879003  1.889926  0.983640
3  Amy Clements      1.452253  0.784734 -0.673695 -0.342138
4  Stanley Coston    -0.143222 -0.293293  0.683647  0.638479

df3
   name              A         B         C         D  
0  Jeffrey Ray       0.638273  0.687777  0.766666  0.900032
1  Sara Flicker      0.635744  1.699001  0.176098 -1.11113
2  Samantha Runyon   0.323333 -0.879003  1.889926  -0.683640
3  Amy Clements      -1.452253  0.784734 -0.673695 -0.342138
4  Stanley Coston    -0.143222 -0.282222  0.683647  -0.638479

I want to map name in each name and get the sum D:

Jeffrey Ray  0.366607 0.88897 0.900032
Amy Clements -0.342138 (-0.342138) (-0.342138)
Stanley Coston 0.538049 0.638479-0.638479

I tried pd.merge:

pd.merge(df1, df2, on='name', how='inner').sum().reset_index()

But the output is not what I expected. Any suggestion on how should I change it? Another issue is that it can only merge two datasets each time, is there an efficient way that I can map multiple datasets at the same time without a for loop?

Thanks

CodePudding user response:

How about:

dfs = pd.concat([df1,df2,df3])
out = dfs.groupby('name')['D'].sum()[dfs['name'].value_counts()==3]

Output:

 name
Amy Clements     -1.026414
Jeffrey Ray       2.155609
Stanley Coston    0.538049
Name: D, dtype: float64

CodePudding user response:

You can concat only the D column, using the names as index. Use dropna to get rid of the incomplete datasets, and sum:

dfs = [df1, df2, df3]
(pd.concat([d.set_index('name')['D'] for d in dfs], axis=1)
   .dropna()
   .sum(1)
)

output:

name
Jeffrey Ray       2.155609
Amy Clements     -1.026414
Stanley Coston    0.538049
  • Related