Home > Back-end >  Merging two multiindex dataframes
Merging two multiindex dataframes

Time:09-22

I have 2 dataframes:

df1 = pd.DataFrame.from_dict({('category', ''): {0: 'A',
  1: 'B',
  2: 'C',
  3: 'D',
  4: 'E',
  5: 'F',
  6: 'G'},
 (pd.Timestamp('2021-06-28 00:00:00'),
  'metric_1'): {0: 4120.549999999999, 1: 11226.016666666665, 2: 25049.443333333333, 3: 18261.083333333332, 4: 2553.1208333333334, 5: 2843.01, 6: 73203.51333333334},
 (pd.Timestamp('2021-06-28 00:00:00'), 'metric_2'): {0: 9907.79,
  1: 7614.650000000001,
  2: 13775.259999999998,
  3: 13158.250000000004,
  4: 1457.85,
  5: 1089.5600000000002,
  6: 38864.9},
 (pd.Timestamp('2021-07-05 00:00:00'),
  'metric_1'): {0: 5817.319999999998, 1: 10799.45, 2: 23521.51, 3: 22062.350833333334, 4: 1249.5974999999999, 5: 3229.77, 6: 52796.06083333332},
 (pd.Timestamp('2021-07-05 00:00:00'), 'metric_2'): {0: 6321.21,
  1: 5606.01,
  2: 10239.689999999999,
  3: 17476.600000000002,
  4: 943.7199999999999,
  5: 1410.33,
  6: 29645.45}}).set_index('category')
df2 = pd.DataFrame.from_dict({'category': {0: 'A',
  1: 'B',
  2: 'C',
  3: 'D',
  4: 'E',
  5: 'F',
  6: 'G'},
 1: {0: 36234.035577957984,
  1: 69078.07089184562,
  2: 128879.5397517309,
  3: 178376.63536908248,
  4: 9293.956915067887,
  5: 8184.780211399392,
  6: 177480.74540313095},
 2: {0: 37887.581678419825,
  1: 72243.67956241772,
  2: 134803.02342121338,
  3: 186603.8963173654,
  4: 9716.385738295368,
  5: 8555.606693927,
  6: 185658.87577993725}}).set_index('category')

First I change the column names of df2 to be the same as df

date_mappings = {
1 : '2021-06-28',
2 : '2021-07-05'}

df2 = df2.rename(columns=date_mappings)

Then I try to merge it

f = lambda x: pd.to_datetime(x)
df = (df2.merge(df1.unstack(), left_index=True, right_index=True).sort_index(axis=1))

But I get an error:

ValueError: Cannot merge a Series without a name

What is my mistake?

My goal is to add columns from df2 to df1 in each week like so that df1 would have 3 columns instead of 2.

enter image description here

After using

c = [df2.columns.map(date_mappings.get), df2.columns]
df1.join(df2.set_axis(c, axis=1)).sort_index(axis=1)

I get the values appended to the end of the dataframe rather than to the same columns with the same week naming:

enter image description here

Maybe this could be an issue that df2 holds dates from 2021-06-28 to 2022-06-27 while df1 holds dates from 2020 to today.

Unwanted adding to the end of the df enter image description here

CodePudding user response:

Idea is create MultiIndex in both DataFrames:

date_mappings = {
1 : '2021-06-28',
2 : '2021-07-05'}

#create MultiIndex in df2 with datetimes in first level
df2.columns = pd.MultiIndex.from_product([pd.to_datetime(df2.columns.map(date_mappings)), 
                                          ['metric_3']])

#removed unused levels, here category, so possible convert first leve to datetimes
df1.columns = df1.columns.remove_unused_levels()
df1.columns = df1.columns.set_levels(pd.to_datetime(df1.columns.levels[0]), level=0)

#join together and sorting MultiIndex
df = df1.join(df2).sort_index(axis=1)

print (df)
            2021-06-28                             2021-07-05            \
              metric_1  metric_2       metric_3      metric_1  metric_2   
category                                                                  
A          4120.550000   9907.79   36234.035578   5817.320000   6321.21   
B         11226.016667   7614.65   69078.070892  10799.450000   5606.01   
C         25049.443333  13775.26  128879.539752  23521.510000  10239.69   
D         18261.083333  13158.25  178376.635369  22062.350833  17476.60   
E          2553.120833   1457.85    9293.956915   1249.597500    943.72   
F          2843.010000   1089.56    8184.780211   3229.770000   1410.33   
G         73203.513333  38864.90  177480.745403  52796.060833  29645.45   

                         
               metric_3  
category                 
A          37887.581678  
B          72243.679562  
C         134803.023421  
D         186603.896317  
E           9716.385738  
F           8555.606694  
G         185658.875780  

If need remove greater datetimes like maximal df1 datetimes use:

#change mapping for test
date_mappings = {
1 : '2021-06-28',
2 : '2022-07-05'}

df2.columns = pd.MultiIndex.from_product([pd.to_datetime(df2.columns.map(date_mappings)), 
                                          ['metric_3']])

df1.columns = df1.columns.remove_unused_levels()
df1.columns = df1.columns.set_levels(pd.to_datetime(df1.columns.levels[0]), level=0)


df2 = df2.loc[:, df2.columns.get_level_values(0) <= df1.columns.get_level_values(0).max()]
print (df2)
             2021-06-28
               metric_3
category               
A          36234.035578
B          69078.070892
C         128879.539752
D         178376.635369
E           9293.956915
F           8184.780211
G         177480.745403

#join together and sorting MultiIndex
df = df1.join(df2).sort_index(axis=1)

print (df)
            2021-06-28                             2021-07-05          
              metric_1  metric_2       metric_3      metric_1  metric_2
category                                                               
A          4120.550000   9907.79   36234.035578   5817.320000   6321.21
B         11226.016667   7614.65   69078.070892  10799.450000   5606.01
C         25049.443333  13775.26  128879.539752  23521.510000  10239.69
D         18261.083333  13158.25  178376.635369  22062.350833  17476.60
E          2553.120833   1457.85    9293.956915   1249.597500    943.72
F          2843.010000   1089.56    8184.780211   3229.770000   1410.33
G         73203.513333  38864.90  177480.745403  52796.060833  29645.45

CodePudding user response:

Use pd.DataFrame.reindex pd.DataFrame.join reindex has a convenient level parameter that allows you to expand on the index levels not present.

df1.join(df2.reindex(df1.index, level=0))

CodePudding user response:

I am not sure if this is what you want but you might need to_frame:

f = lambda x: pd.to_datetime(x)
df = (df2.merge(df1.unstack().to_frame(), left_index=True, right_index=True).sort_index(level=0))
print(df)
  • Related