I m trying to merge these dataframes in a way that the final data frame would have matched the country year gdp from first dataframe with its corresponding values from second data frame.
[] []
first data frame :
Country | Country code | year | rgdpe |
---|---|---|---|
country1 | Code1 | year1 | rgdpe1 |
country1 | Code1 | yearn | rgdpen |
country2 | Code2 | year1 | rgdpe1' |
second dataframe :
countries | value | year |
---|---|---|
country1 | value1 | year1 |
country1 | valuen | yearn |
country2 | Code2 | year1 |
combined dataframe: | Country | Country code | year |rgdpe |value| |:--------|:------------:|:----:|:-----:|:---:| |country1 | Code1 | year1|rgdpe1 |value| |country1 | Code1 | yearn|rgdpen |Value| |country2 | Code2 | year1|rgdpe1'|Value|
combined=pd.merge(left=df_biofuel_prod, right=df_GDP[['rgdpe']], left_on='Value', right_on='country', how='right')
combined.to_csv('../../combined_test.csv')
the results of this code gives me just the rgdpe column while the other column are empty. What would be the most efficient way to merge and match these dataframes ?
CodePudding user response:
First, from the data screen cap, it looks like the "country" column in your first dataset "df_GDP" is set as index. Reset it using "reset_index()". Then merge on multiple columns like left_on=["countries","year"]
and right_on=["country","year"]
. And since you want to retain all records from your main dataframe "df_biofuel_prod", so it should be "left" join:
combined_df = df_biofuel_prod.merge(df_GDP.reset_index(), left_on=["countries","year"], right_on=["country","year"], how="left")
Full example with dummy data:
df_GDP = pd.DataFrame(data=[["USA",2001,400],["USA",2002,450],["CAN",2001,150],["CAN",2002,170]], columns=["country","year","rgdpe"]).set_index("country")
df_biofuel_prod = pd.DataFrame(data=[["USA",400,2001],["USA",450,2003],["CAN",150,2001],["CAN",170,2003]], columns=["countries","Value","year"])
combined_df = df_biofuel_prod.merge(df_GDP.reset_index(), left_on=["countries","year"], right_on=["country","year"], how="left")
[Out]:
countries Value year country rgdpe
0 USA 400 2001 USA 400.0
1 USA 450 2003 NaN NaN
2 CAN 150 2001 CAN 150.0
3 CAN 170 2003 NaN NaN
You see "NaN" where matching data is not available in "df_GDP".