Home > OS >  Combining two Dataframes with same column names, but different MultiIndex
Combining two Dataframes with same column names, but different MultiIndex

Time:10-18

I have two different DataFrames that I would like to combine and I cannot get the answer I am looking for. A MWE is given below.

import numpy as np
import pandas as pd

index_tuple1 = [('car', 'green'), ('car', 'red'), ('boat', 'green')]
index1 = pd.MultiIndex.from_tuples(index_tuple1, names=['Vehicle', 'Color'])
data1 = np.array([[1,2], [3,4], [7,8]])
df1 = pd.DataFrame(data1, index=index1, columns=['2022-05-09', '2022-10-17'])

index_tuple2 = [('car', 'blue'), ('boat', 'blue')]
index2 = pd.MultiIndex.from_tuples(index_tuple2, names=['Vehicle', 'Color'])
data2 = np.array([[5,6], [9,0]])
df2 = pd.DataFrame(data2, index=index2, columns=['2022-05-09', '2022-10-17'])

print(df1)
print(df2)


Output:
               2022-05-09  2022-10-17
Vehicle Color
car     green           1           2
        red             3           4
boat    green           7           8

               2022-05-09  2022-10-17
Vehicle Color
car     blue            5           6
boat    blue            9           0

What I would like to see as output is given below. I have tried concat, merge, and join, but I either get a DataFrame filled with NaNs or "ValueError: columns overlap but no suffix specified: Index(['2022-05-09', '2022-10-17']

               2022-05-09  2022-10-17
Vehicle Color
car     green           1           2
        red             3           4
        blue            5           6
boat    green           7           8
        blue            9           0 

CodePudding user response:

You need same types of columns, e.g. strings or datetimes, then concat working correct:

df1.columns = pd.to_datetime(df1.columns)
df2.columns = pd.to_datetime(df2.columns)

df = pd.concat([df1, df2]).sort_index()
print (df)
               2022-05-09  2022-10-17
Vehicle Color                        
boat    blue            9           0
        green           7           8
car     blue            5           6
        green           1           2
        red             3           4

If ordering is important use here is solution with ordered CategoricalIndex:

df1.columns = pd.to_datetime(df1.columns)
df2.columns = pd.to_datetime(df2.columns)

lvl = df1.index.get_level_values(0).append(df1.index.get_level_values(0)).unique()
print (lvl)
Index(['car', 'boat'], dtype='object', name='Vehicle')

idx1 = pd.CategoricalIndex(df1.index.levels[0], ordered=True, categories=lvl)
idx2 = pd.CategoricalIndex(df2.index.levels[0], ordered=True, categories=lvl)
df1.index = df1.index.set_levels(idx1, level = 0)
df2.index = df2.index.set_levels(idx2, level = 0)


df = pd.concat([df1, df2]).sort_index(level=0, sort_remaining=False)
print (df)
               2022-05-09  2022-10-17
Vehicle Color                        
car     green           1           2
        red             3           4
        blue            5           6
boat    green           7           8
        blue            9           0
  • Related