combining dataframes that have the same 'country name' and same 'year'-CodePudding

I m trying to merge these dataframes in a way that the final data frame would have matched the country year gdp from first dataframe with its corresponding values from second data frame.

[ dataframe 1: df_GDP ] [ dataframe 2:df_biofuel_prod ]

first data frame :

Country	Country code	year	rgdpe
country1	Code1	year1	rgdpe1
country1	Code1	yearn	rgdpen
country2	Code2	year1	rgdpe1'

second dataframe :

countries	value	year
country1	value1	year1
country1	valuen	yearn
country2	Code2	year1

combined dataframe: | Country | Country code | year |rgdpe |value| |:--------|:------------:|:----:|:-----:|:---:| |country1 | Code1 | year1|rgdpe1 |value| |country1 | Code1 | yearn|rgdpen |Value| |country2 | Code2 | year1|rgdpe1'|Value|

    combined=pd.merge(left=df_biofuel_prod, right=df_GDP[['rgdpe']], left_on='Value', right_on='country', how='right')
    combined.to_csv('../../combined_test.csv')

the results of this code gives me just the rgdpe column while the other column are empty. What would be the most efficient way to merge and match these dataframes ?

CodePudding user response：

First, from the data screen cap, it looks like the "country" column in your first dataset "df_GDP" is set as index. Reset it using "reset_index()". Then merge on multiple columns like left_on=["countries","year"] and right_on=["country","year"]. And since you want to retain all records from your main dataframe "df_biofuel_prod", so it should be "left" join:

combined_df = df_biofuel_prod.merge(df_GDP.reset_index(), left_on=["countries","year"], right_on=["country","year"], how="left")

Full example with dummy data:

df_GDP = pd.DataFrame(data=[["USA",2001,400],["USA",2002,450],["CAN",2001,150],["CAN",2002,170]], columns=["country","year","rgdpe"]).set_index("country")

df_biofuel_prod = pd.DataFrame(data=[["USA",400,2001],["USA",450,2003],["CAN",150,2001],["CAN",170,2003]], columns=["countries","Value","year"])

combined_df = df_biofuel_prod.merge(df_GDP.reset_index(), left_on=["countries","year"], right_on=["country","year"], how="left")

[Out]:
  countries  Value  year country  rgdpe
0       USA    400  2001     USA  400.0
1       USA    450  2003     NaN    NaN
2       CAN    150  2001     CAN  150.0
3       CAN    170  2003     NaN    NaN

You see "NaN" where matching data is not available in "df_GDP".