Home > Software design >  Assert equality of subsets of pandas dataframes
Assert equality of subsets of pandas dataframes

Time:06-02

I have a list of pandas dataframes. I want to ensure pairwise equality for existing rows and columns. Example of dataframes:

import pandas as pd

df1 = pd.DataFrame({"ix": [1, 2, 3], "1": [3, 4, 5]                }).set_index("ix")
df2 = pd.DataFrame({"ix": [1, 2   ], "1": [3, 4   ], "2": [3, 4   ]}).set_index("ix")
df3 = pd.DataFrame({"ix": [   2, 3], "1": [   4, 5], "2": [   4, 6]}).set_index("ix")
df4 = pd.DataFrame({"ix": [      3],                 "2": [      6]}).set_index("ix")
dataframes = [df1, df2, df3, df4]

My requirement is fulfilled. I wrote the following code to check that:

from pandas._testing import assert_frame_equal

kwargs = {"check_dtype": False, "check_like": True}

for i, left in enumerate(dataframes):
    for right in dataframes[i   1:]:
        cl = left.columns.intersection(right.columns)
        ix = left.index.intersection(right.index)
        assert_frame_equal(left.loc[ix, cl], right.loc[ix, cl], **kwargs)

I have the feeling that the performance might be very bad for long lists and huge dataframes.

My question: Is that really the best way to do that?

CodePudding user response:

Except using itertools.combinations (syntactic sugar?), I don't know how you can enhance your code:

from itertools import combinations

for left, right in combinations(dataframes, 2):
    cl = left.columns.intersection(right.columns)
    ix = left.index.intersection(right.index)
    assert_frame_equal(left.loc[ix, cl], right.loc[ix, cl], check_like=True)

CodePudding user response:

What makes me really feel bad is the nested for loop. I have a solution now that has only one for loop.

kwargs = {"check_dtype": False, "check_like": True}

basis = pd.concat(dfs).groupby("ix").first()
[assert_frame_equal(basis.loc[df.index, df.columns], df, **kwargs) for df in dfs]
  • Related