Home > Mobile >  combining dataframes that have the same 'country name' and same 'year'
combining dataframes that have the same 'country name' and same 'year'

Time:11-21

I m trying to merge these dataframes in a way that the final data frame would have matched the country year gdp from first dataframe with its corresponding values from second data frame.

[dataframe 1: df_GDP] [dataframe 2:df_biofuel_prod]

first data frame :

Country Country code year rgdpe
country1 Code1 year1 rgdpe1
country1 Code1 yearn rgdpen
country2 Code2 year1 rgdpe1'

second dataframe :

countries value year
country1 value1 year1
country1 valuen yearn
country2 Code2 year1

combined dataframe: | Country | Country code | year |rgdpe |value| |:--------|:------------:|:----:|:-----:|:---:| |country1 | Code1 | year1|rgdpe1 |value| |country1 | Code1 | yearn|rgdpen |Value| |country2 | Code2 | year1|rgdpe1'|Value|

    combined=pd.merge(left=df_biofuel_prod, right=df_GDP[['rgdpe']], left_on='Value', right_on='country', how='right')
    combined.to_csv('../../combined_test.csv')

the results of this code gives me just the rgdpe column while the other column are empty. What would be the most efficient way to merge and match these dataframes ?

CodePudding user response:

First, from the data screen cap, it looks like the "country" column in your first dataset "df_GDP" is set as index. Reset it using "reset_index()". Then merge on multiple columns like left_on=["countries","year"] and right_on=["country","year"]. And since you want to retain all records from your main dataframe "df_biofuel_prod", so it should be "left" join:

combined_df = df_biofuel_prod.merge(df_GDP.reset_index(), left_on=["countries","year"], right_on=["country","year"], how="left")

Full example with dummy data:

df_GDP = pd.DataFrame(data=[["USA",2001,400],["USA",2002,450],["CAN",2001,150],["CAN",2002,170]], columns=["country","year","rgdpe"]).set_index("country")

df_biofuel_prod = pd.DataFrame(data=[["USA",400,2001],["USA",450,2003],["CAN",150,2001],["CAN",170,2003]], columns=["countries","Value","year"])

combined_df = df_biofuel_prod.merge(df_GDP.reset_index(), left_on=["countries","year"], right_on=["country","year"], how="left")

[Out]:
  countries  Value  year country  rgdpe
0       USA    400  2001     USA  400.0
1       USA    450  2003     NaN    NaN
2       CAN    150  2001     CAN  150.0
3       CAN    170  2003     NaN    NaN

You see "NaN" where matching data is not available in "df_GDP".

  • Related