Home > other >  Merging 2 datasets in Python
Merging 2 datasets in Python

Time:12-18

I have 2 diffferent datasets and I want to merge these 2 datasets based on column "country" with the common country names and dropping the ones different. I have done it with inner merge, but the dataset is not as I want to have.

inner_merged = pd.merge(TFC_DATA,CO2_DATA,how="inner",on="country")

TFC_DATA (in the orginal dataset there exits a column called year but I've dropped it):

| Country | TFP |

| Angola | 0.8633379340171814 |

| Angola | 0.9345720410346984 |

| Angola | 1.0301895141601562 |

| Angola | 1.0850582122802734 |

.
.
.

CO2_DATA: | Country | year | GDP | co2

| Afghanistan | 2005 | 25397688320.0 | 1

| Afghanistan | 2006 | 28704401408.0 | 2

| Afghanistan | 2007 | 34507530240.0 | 2

| Afghanistan | 2008 | 1.0850582122802734 | 3

| Afghanistan | 2009 | 1.040212631225586 | 1

. . .

What I want is

Output |Country|Year|gdp|co2|TFP Angola|2005|51967275008.0|19.006|0.8633379340171814 Angola|2006|66748907520.0|19.006|0.9345720410346984 Angola|2007|87085293568.0|19.006|1.0301895141601562 . . .

What I have instead

Output Country|Year|gdp|co2|Year|TFP Angola|2005|51967275008.0|19.006|2005|0.8633379340171814 Angola|2005|51967275008.0|19.006|2006|0.9345720410346984 Angola|2005|51967275008.0|19.006|2007|1.0301895141601562 Angola|2005|51967275008.0|19.006|2008|1.0850582122802734 Angola|2005|51967275008.0|19.006|2009|1.040212631225586 Angola|2005|51967275008.0|19.006|2010|1.0594196319580078 Angola|2005|51967275008.0|19.006|2011|1.036203384399414 Angola|2005|51967275008.0|19.006|2012|1.076979637145996 Angola|2005|51967275008.0|19.006|2013|1.0862818956375122 Angola|2005|51967275008.0|19.006|2014|1.096832513809204 Angola|2005|51967275008.0|19.006|2015|1.0682281255722046 Angola|2005|51967275008.0|19.006|2016|1.0160540342330933 Angola|2005|51967275008.0|19.006|2017|1.0

I expected the datas of the countrys' merge in one dataset but it duplicates itself until the second one data is over then the second one does the same

CodePudding user response:

TFC_DATA (in the orginal dataset there exits a column called year but I've dropped it):

Well, based on your expected output, you should not drop the column Year from the dataframe TFC_DATA. Only then, you can use pandas.merge (as shown below). Because otherwise, you'll have duplicated values.

pd.merge(CO2_DATA, TFC_DATA, left_on=["country", "year"], right_on=["country", "Year"])

OR :

pd.merge(CO2_DATA, TFC_DATA.rename(columns={"Year": "year"}), on=["country", "year"])

CodePudding user response:

pd.merge() function performs an inner join by default that means it only includes rows that have matching values in the specified columns.

Use a different join type one option is to use a left outer join, which will include all rows from the left dataset (TFC_DATA) and only the matching rows from the right dataset (CO2_DATA).

Specify a left outer join using the how="left" parameter in the pd.merge() function.

merged_data = pd.merge(TFC_DATA, CO2_DATA, how="left", on="country")

After @abokey's comment EDIT

First, create a new column in the TFC_DATA dataset with the year value

TFC_DATA["year"] = TFC_DATA.index.year

Group the TFC_DATA dataset by "country" and "year", and compute the mean TFP value for each group

TFC_DATA_agg = TFC_DATA.groupby(["country", "year"]).mean()

Reset the index to make "country" and "year" columns in the resulting dataset

TFC_DATA_agg = TFC_DATA_agg.reset_index()

Perform the inner merge, using "country" and "year" as the merge keys

merged_data = pd.merge(CO2_DATA, TFC_DATA_agg, how="inner", on=["country", "year"])

  • Related