Home > Back-end >  Compare columns across dataframes Python
Compare columns across dataframes Python

Time:07-25

I have many dataframes. I need to compare the columns across all of them. Ideally I want to return the positions of those columns that do not match as well as the name of columns from df1 and df2 being compared.

Note: I only need to compare the columns, not the data.

Example:


df1 = pd.DataFrame(columns=['Name', 'Age', 'Address', 'Telephone'])
df2 = pd.DataFrame(columns=['Nombre', 'Age', 'Address', 'Telefono'])
df3 = pd.DataFrame(columns=['N.', 'A.', 'Address', 'Telephone', 'Email'])
df4 = pd.DataFrame(columns=['Name', 'Age', 'Address', 'Telephone'])

Expected output:

DataFrame_A DataFrame_B Positions Col_df_A Col_df_B
df1 df2 0,3 ['Name', 'Telephone'] ['Nombre', 'Telefono']
df1 df3 0,1,4 ['Name', 'Age'] ['N.', 'A.', 'Email'

What is best way to do this?

from itertools import combinations

dfs_dict = {"df1": df1, "df2": df2, "df3": df3, "df4": df4}

dfs_names = dfs_dict.keys().tolist()

for df1, df2 in combinations(dfs_names, 2):
...

CodePudding user response:

pandas.DataFrame instances have .columns attribute which you can use following way

import pandas as pd
df1 = pd.DataFrame(columns=['Name', 'Age', 'Address', 'Telephone'])
df2 = pd.DataFrame(columns=['Nombre', 'Age', 'Address', 'Telefono'])
df3 = pd.DataFrame(columns=['N.', 'A.', 'Address', 'Telephone', 'Email'])
df4 = pd.DataFrame(columns=['Name', 'Age', 'Address', 'Telephone'])
print(df1.columns != df2.columns)  # [ True False False  True]
print(df1.columns != df4.columns)  # [False False False False]

comparison gives True where names are different else False, result is numpy.array so you can easily find position of Trues as follows

import numpy as np
print(np.argwhere(df1.columns != df2.columns).ravel())  # [0 3]
print(np.argwhere(df1.columns != df4.columns).ravel())  # []

I use .ravel() to get flat structure, as we do not need to care about multiple dimensions in this case. Disclaimer: proposed solution is limited to pandas.DataFrames where number of columns is exactly equal.

CodePudding user response:

dfs_dict = {"df1": df1.columns, "df2": df2.columns, "df3": df3.columns, "df4": df4.columns}
rows = []

for df_name, columns_1 in dfs_dict.items():
  for df_name2, columns_2 in dfs_dict.items():
    if df_name != df_name2:
      differents_A = columns_1.difference(columns_2)
      differents_B = columns_2.difference(columns_1)
      positions = [columns_1.get_loc(column) for column in differents_A]
      rows.append({"Dataframe_A": df_name,
               "Dataframe_B": df_name2,
               "Positions": positions,
               "Col_df_A": list(differents_A),
               "Col_df_B": list(differents_B)})
  
df_diff = pd.DataFrame(rows)
print(df_diff)

Output:

    Dataframe_A Dataframe_B Positions   Col_df_A                Col_df_B
0           df1         df2 [0, 3]      [Name, Telephone]       [Nombre, Telefono]
1           df1         df3 [1, 0]      [Age, Name]             [A., Email, N.]
2           df1         df4 []          []                      []
3           df2         df1 [0, 3]      [Nombre, Telefono]      [Name, Telephone]
4           df2         df3 [1, 0, 3]   [Age, Nombre, Telefono] [A., Email, N., Telephone]
5           df2         df4 [0, 3]      [Nombre, Telefono]      [Name, Telephone]
6           df3         df1 [1, 4, 0]   [A., Email, N.]         [Age, Name]
7           df3         df2 [1, 4, 0, 3][A., Email, N., Telephone]  [Age, Nombre, Telefono]
8           df3         df4 [1, 4, 0]   [A., Email, N.]         [Age, Name]
9           df4         df1 []          []                      []
10          df4         df2 [0, 3]      [Name, Telephone]       [Nombre, Telefono]
11          df4         df3 [1, 0]      [Age, Name]             [A., Email, N.]
  • Related