I have a list of pandas
dataframes. I want to ensure pairwise equality for existing rows and columns. Example of dataframes:
import pandas as pd
df1 = pd.DataFrame({"ix": [1, 2, 3], "1": [3, 4, 5] }).set_index("ix")
df2 = pd.DataFrame({"ix": [1, 2 ], "1": [3, 4 ], "2": [3, 4 ]}).set_index("ix")
df3 = pd.DataFrame({"ix": [ 2, 3], "1": [ 4, 5], "2": [ 4, 6]}).set_index("ix")
df4 = pd.DataFrame({"ix": [ 3], "2": [ 6]}).set_index("ix")
dataframes = [df1, df2, df3, df4]
My requirement is fulfilled. I wrote the following code to check that:
from pandas._testing import assert_frame_equal
kwargs = {"check_dtype": False, "check_like": True}
for i, left in enumerate(dataframes):
for right in dataframes[i 1:]:
cl = left.columns.intersection(right.columns)
ix = left.index.intersection(right.index)
assert_frame_equal(left.loc[ix, cl], right.loc[ix, cl], **kwargs)
I have the feeling that the performance might be very bad for long lists and huge dataframes.
My question: Is that really the best way to do that?
CodePudding user response:
Except using itertools.combinations
(syntactic sugar?), I don't know how you can enhance your code:
from itertools import combinations
for left, right in combinations(dataframes, 2):
cl = left.columns.intersection(right.columns)
ix = left.index.intersection(right.index)
assert_frame_equal(left.loc[ix, cl], right.loc[ix, cl], check_like=True)
CodePudding user response:
What makes me really feel bad is the nested for loop. I have a solution now that has only one for loop.
kwargs = {"check_dtype": False, "check_like": True}
basis = pd.concat(dfs).groupby("ix").first()
[assert_frame_equal(basis.loc[df.index, df.columns], df, **kwargs) for df in dfs]