I have many dataframes. I need to compare the columns across all of them. Ideally I want to return the positions of those columns that do not match as well as the name of columns from df1 and df2 being compared.
Note: I only need to compare the columns, not the data.
Example:
df1 = pd.DataFrame(columns=['Name', 'Age', 'Address', 'Telephone'])
df2 = pd.DataFrame(columns=['Nombre', 'Age', 'Address', 'Telefono'])
df3 = pd.DataFrame(columns=['N.', 'A.', 'Address', 'Telephone', 'Email'])
df4 = pd.DataFrame(columns=['Name', 'Age', 'Address', 'Telephone'])
Expected output:
DataFrame_A | DataFrame_B | Positions | Col_df_A | Col_df_B |
---|---|---|---|---|
df1 | df2 | 0,3 | ['Name', 'Telephone'] | ['Nombre', 'Telefono'] |
df1 | df3 | 0,1,4 | ['Name', 'Age'] | ['N.', 'A.', 'Email' |
What is best way to do this?
from itertools import combinations
dfs_dict = {"df1": df1, "df2": df2, "df3": df3, "df4": df4}
dfs_names = dfs_dict.keys().tolist()
for df1, df2 in combinations(dfs_names, 2):
...
CodePudding user response:
pandas.DataFrame
instances have .columns
attribute which you can use following way
import pandas as pd
df1 = pd.DataFrame(columns=['Name', 'Age', 'Address', 'Telephone'])
df2 = pd.DataFrame(columns=['Nombre', 'Age', 'Address', 'Telefono'])
df3 = pd.DataFrame(columns=['N.', 'A.', 'Address', 'Telephone', 'Email'])
df4 = pd.DataFrame(columns=['Name', 'Age', 'Address', 'Telephone'])
print(df1.columns != df2.columns) # [ True False False True]
print(df1.columns != df4.columns) # [False False False False]
comparison gives True
where names are different else False
, result is numpy.array
so you can easily find position of True
s as follows
import numpy as np
print(np.argwhere(df1.columns != df2.columns).ravel()) # [0 3]
print(np.argwhere(df1.columns != df4.columns).ravel()) # []
I use .ravel()
to get flat structure, as we do not need to care about multiple dimensions in this case. Disclaimer: proposed solution is limited to pandas.DataFrame
s where number of columns is exactly equal.
CodePudding user response:
dfs_dict = {"df1": df1.columns, "df2": df2.columns, "df3": df3.columns, "df4": df4.columns}
rows = []
for df_name, columns_1 in dfs_dict.items():
for df_name2, columns_2 in dfs_dict.items():
if df_name != df_name2:
differents_A = columns_1.difference(columns_2)
differents_B = columns_2.difference(columns_1)
positions = [columns_1.get_loc(column) for column in differents_A]
rows.append({"Dataframe_A": df_name,
"Dataframe_B": df_name2,
"Positions": positions,
"Col_df_A": list(differents_A),
"Col_df_B": list(differents_B)})
df_diff = pd.DataFrame(rows)
print(df_diff)
Output:
Dataframe_A Dataframe_B Positions Col_df_A Col_df_B
0 df1 df2 [0, 3] [Name, Telephone] [Nombre, Telefono]
1 df1 df3 [1, 0] [Age, Name] [A., Email, N.]
2 df1 df4 [] [] []
3 df2 df1 [0, 3] [Nombre, Telefono] [Name, Telephone]
4 df2 df3 [1, 0, 3] [Age, Nombre, Telefono] [A., Email, N., Telephone]
5 df2 df4 [0, 3] [Nombre, Telefono] [Name, Telephone]
6 df3 df1 [1, 4, 0] [A., Email, N.] [Age, Name]
7 df3 df2 [1, 4, 0, 3][A., Email, N., Telephone] [Age, Nombre, Telefono]
8 df3 df4 [1, 4, 0] [A., Email, N.] [Age, Name]
9 df4 df1 [] [] []
10 df4 df2 [0, 3] [Name, Telephone] [Nombre, Telefono]
11 df4 df3 [1, 0] [Age, Name] [A., Email, N.]