Home > Enterprise >  Merging multiple dataframes with overlapping rows and different columns
Merging multiple dataframes with overlapping rows and different columns

Time:10-16

I have multiple pandas data frames with some common columns and some overlapping rows. I would like to combine them in such a way that I have one final data frame with all of the columns and all of the unique rows (overlapping/duplicate rows dropped). The remaining gaps should be nans.

enter image description here

I have come up with the function below. In essence it goes through all columns one by one, appending all of the values from each data frame, dropping the duplicates (overlap), and building a new output data frame column by column.

def combine_dfs(dataframes:list):
    
    ## Identifying all unique columns in all data frames
    columns = []
    for df in dataframes:
        columns.extend(df.columns)
    columns = np.unique(columns)
    
    ## Appending values from each data frame per column
    output_df = pd.DataFrame()
    for col in columns:
        column = pd.Series(dtype="object", name=col)
        for df in dataframes:
            if col in df.columns:
                column = column.append(df[col])
        
        ## Removing overlapping data (assuming consistent values)
        column = column[~column.index.duplicated()]
        
        ## Adding column to output data frame
        column = pd.DataFrame(column)
        output_df = pd.concat([output_df,column], axis=1)
    
    output_df.sort_index(inplace=True)
    return output_df

df_1 = pd.DataFrame([[10,20,30],[11,21,31],[12,22,32],[13,23,33]], columns=["A","B","C"])
df_2 = pd.DataFrame([[33,43,54],[34,44,54],[35,45,55],[36,46,56]], columns=["C","D","E"], index=[3,4,5,6])
df_3 = pd.DataFrame([[50,60],[51,61],[52,62],[53,63],[54,64]], columns=["E","F"])

print(combine_dfs([df_1,df_2,df_3]))

The output, as intended in the visualization, looks like this:

      A     B   C     D   E     F
0  10.0  20.0  30   NaN  50  60.0
1  11.0  21.0  31   NaN  51  61.0
2  12.0  22.0  32   NaN  52  62.0
3  13.0  23.0  33  43.0  54  63.0
4   NaN   NaN  34  44.0  54  64.0
5   NaN   NaN  35  45.0  55   NaN
6   NaN   NaN  36  46.0  56   NaN

This method works well on small data sets. Is there a way to optimize this?

CodePudding user response:

IIUC you can chain combine_first:

print (df_1.combine_first(df_2).combine_first(df_3))

      A     B   C     D     E     F
0  10.0  20.0  30   NaN  50.0  60.0
1  11.0  21.0  31   NaN  51.0  61.0
2  12.0  22.0  32   NaN  52.0  62.0
3  13.0  23.0  33  43.0  54.0  63.0
4   NaN   NaN  34  44.0  54.0  64.0
5   NaN   NaN  35  45.0  55.0   NaN
6   NaN   NaN  36  46.0  56.0   NaN
  • Related