Subtract dataframes with completely different row names and column names-CodePudding

My dataframe 1 looks like this:

windcodes	name	yield	perp
163197.SH	shangguo comp	2.9248	NO
154563.SH	guosheng comp	2.886	Yes
789645.IB	guoyou comp	3.418	NO

My dataframe 2 looks like this

windcodes	CALC
1202203.IB	2.5517
1202203.IB	2.48457
1202203.IB	2.62296

and I want my result dataframe 3 to have one more new column than dataframe 1 which is to use the value in column 'yield' in dataframe 1 subtract the value in column 'CALC' in dataframe 2: The result dataframe 3 should be looking like this

windcodes	name	yield	perp	yield-CALC
163197.SH	shangguo comp	2.9248	NO	0.3731
154563.SH	guosheng comp	2.886	Yes	0.40413
789645.IB	guoyou comp	3.418	NO	0.79504

It would be really helpful if anyone can tell me how to do it in python.

CodePudding user response：

You can try something like this:

df1['yield-CALC'] = df1['yield'] - df2['yield']

I'm assuming you don't want to join the dataframes, since the windcodes are not the same.

CodePudding user response：

Do we need to join 2 dataframes from windcodes column? The windcodes are all the same in the sample data you have given in Dataframe2. Can you explain this?

If we are going to join from the windscode field. The code below will work.

df = pd.merge(left=df1, right=df2,how='inner',on='windcodes')
df['yield-CALC'] = df['yield']-df['CALC']

CodePudding user response：

I will try to keep it as elaborated as possible:

environment I have used for coding is Jupyter Notebook

importing our required pandas library

import pandas as pd

getting your first table data in form of lists of lists (you can also use csv,excel etc here)

data_1 = [["163197.SH","shangguo comp",2.9248,"NO"],\
["154563.SH","guosheng comp",2.886,"Yes"] , ["789645.IB","guoyou comp",3.418,"NO"]]

creating dataframe one :

df_1 = pd.DataFrame(data_1 , columns = ["windcodes","name","yield","perp"])
df_1

Output:

getting your second table data in form of lists of lists (you can also use csv,excel etc here)

data_2 = [["1202203.IB",2.5517],["1202203.IB",2.48457],["1202203.IB",2.62296]]

creating dataframe two :

df_2 = pd.DataFrame(data_2 , columns = ["windcodes","CALC"])
df_2

Output:

Now creating the third dataframe:

df_3 = df_1 # becasue first 4 columns are same as our first dataframe 
df_3

Output:

Now calculating the fourth column i.e "yield-CALC" :

df_3["yield-CALC"] = df_1["yield"] - df_2["CALC"] # each df_1 datapoint will be subtracted from df_2 datapoint one by one (still confused? search for "SIMD")

df_3

Output:

Hope you understood.

Happy coding!

CodePudding user response：

Just in case you have completely different indexes, use df2's underlying numpy array:

df1['yield-CALC'] = df1['yield'] - df2['yield'].values