Home > database >  How to Merge two datasets with different indexes but one common ID factor?
How to Merge two datasets with different indexes but one common ID factor?

Time:05-07

I am working with two distinct datasets: one regarding COVID-19 statistics and one with demographic characteristics of a city.

The covid19 one, namely covid.df looks as follows:

Note: Date, City ID, City, and State are all indexes

Date City ID City State Population mean Population_2019 mean Confirmed_rate_100k mean Confirmed_rate_100k std death_rate mean death_rate std new_confirmed new_deaths
2020-02 120385 Los Angeles CA 9559699 45959669 0.653 0.556 0.6 0.01 33 5
2020-02 120054 Houtson Texas 3304040 3343560 0.543 0.043 22.34 1.6 60 9
... ... .... ... ... ... ... ... ... ... ... ...
2022-05 120385 Los Angeles CA 9559483 45966549 0.672 0.032 2.3 0.5 22 12

The one with demographic information, demo.df includes the following

City ID HDI Education Mobility Poverty
120385 0.54 72.5 55.522 33.21
120054 0.33 66.2 76.433 12.504

I would like to include the information from demo.df on covid.df, however, given the different indexes for the two datasets, the concat() function has been giving me a hard time.

How can I merge the two such datasets such that covid.df looks like this:

Date City ID City State HDI Education Mobility Poverty Population mean Population_2019 mean Confirmed_rate_100k mean Confirmed_rate_100k std death_rate mean death_rate std new_confirmed new_deaths
2020-02 120385 Los Angeles CA 0.54 72.5 55.522 33.21 9559699 45959669 0.653 0.556 0.6 0.01 33 5
2020-02 120054 Houston TX 0.33 66.2 76.433 12.504 3304040 3343560 0.543 0.043 22.34 1.6 60 9
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2022-05 120385 Los Angeles CA 0.54 72.5 55.522 33.21 9559483 45966549 0.672 0.032 2.3 0.5 22 12

Thank you!

CodePudding user response:

You only need this:

covid = covid.merge(demo, how='left', on='City ID')

For example, suppose we have this input (notice the different indexes of 88, 99 and 'fish', 'fowl'):

covid.df:
       Date  City ID          City    State  Population mean  Population_2019 mean  Confirmed_rate_100k mean  Confirmed_rate_100k std  death_rate mean  death_rate std  new_confirmed  new_deaths
88  2020-02   120385  Los Angeles       CA           9559699              45959669                     0.653                    0.556             0.60            0.01             33           5
99  2020-02   120054      Houtson    Texas           3304040               3343560                     0.543                    0.043            22.34            1.60             60           9
demo.df:
      City ID   HDI  Education  Mobility  Poverty
fish   120385  0.54       72.5    55.522   33.210
fowl   120054  0.33       66.2    76.433   12.50

The output will be

      Date  City ID          City    State  Population mean  Population_2019 mean  Confirmed_rate_100k mean  ...  death_rate std  new_confirmed  new_deaths   HDI  Education  Mobility  Poverty
0  2020-02   120385  Los Angeles       CA           9559699              45959669                     0.653  ...            0.01             33           5  0.54       72.5    55.522   33.210
1  2020-02   120054      Houtson    Texas           3304040               3343560                     0.543  ...            1.60             60           9  0.33       66.2    76.433   12.504

[2 rows x 16 columns]
  • Related