Home > Software engineering >  Join to dataframes based on index where the second dataframe has repeated indexes related to the fir
Join to dataframes based on index where the second dataframe has repeated indexes related to the fir

Time:07-21

I have two data frames where first dataframe has index starting from zero. The second dataframe has repeated indexes starting from zero. I want to to join the two dataframes based on their indexes. First dataframe is like this

      Start_Year    End_Year
0      1500      1500
1      1500      1501
2      1500      1700
3      1500      1800
4      1500      1800
... ... ...
3409    2018    2018
3410    2018    2018
3411    2019    2019
3412    2019    2022
3413    2020    2020
3414 rows × 2 columns

The second dataframe is

0                       [KingdomofPoland, Georgia]
0                 [GrandDuchyofLithuania, Georgia]
1                   [NorthernYuanDynasty, Georgia]
2                 [SpanishEmpire, ChechenRepublic]
2       [CaptaincyGeneralofChile, ChechenRepublic]
                           ...                    
3411             [SyrianOpposition, SpanishEmpire]
3412                 [UnitedStates, SpanishEmpire]
3412                [UnitedKingdom, SpanishEmpire]
3412                  [SaudiArabia, SpanishEmpire]
3413                              [Turkey, Russia]
Length: 31170, dtype: object

I want to join these two dataframes based on index i.e the new dataframe should look like

      Start_Year    End_Year        new_col
0      1500         1500        [KingdomofPoland, Georgia]
0      1500         1500        [GrandDuchyofLithuania, Georgia]
1      1500         1501        [NorthernYuanDynasty, Georgia]
2      1500         1700        [SpanishEmpire, ChechenRepublic]
2      1500         1700        [CaptaincyGeneralofChile, ChechenRepublic]
......
3411    2019        2019        [SyrianOpposition, SpanishEmpire]
3412    2019        2022        [UnitedStates, SpanishEmpire]
3412    2019        2022        [UnitedKingdom, SpanishEmpire]
3412    2019        2022        [SaudiArabia, SpanishEmpire]
.......

What this essentially is I need to replicate the rows of dataframe 1 based on how many times the same index is repeated in the second dataframe. As we can see, in second dataframe, zero index appears twice, so we replicate the rows of zero index of dataframe 1 twice and then join the dataframes and so on. In the end we can reset the index(that I know about).

I am attaching the links of both the dataframes for the reference. Link for first dataframe https://drive.google.com/file/d/1DqxhnMM8R21Olm9zeRJeDgua_ozoRp8P/view?usp=sharing

Link for second dataframe https://drive.google.com/file/d/1sX5xcTeovVqXtZgSZ5cTC5JRdUvaw7gd/view?usp=sharing

I cant figure out how to proceed with such tasks. Please help me out.

CodePudding user response:

Pandas join would do just what you request.

Consider an example

df1 = pd.DataFrame({'a': [0, 1], 'b': [1500, 1501]})
df2 = pd.DataFrame({'a': [0, 0, 1, 1], 'c': ['a', 'b', 'c', 'd']})
df1.set_index('a', inplace=True)
df2.set_index('a', inplace=True)

We have two dataframes:

   a     b
0  0  1500
1  1  1501

and

   a  c
0  0  a
1  0  b
2  1  c
3  1  d

Now

df = df2.join(df1)[['b', 'c']]

Would be the dataframe

      b  c
a         
0  1500  a
0  1500  b
1  1501  c
1  1501  d
  • Related