I would like to check if two columns in my pandas dataframe contain two values, but it doesn't matter which column the values are in.
I came up with this
df[
((
df['Entity1'] == 'A'
)
|
(
df['Entity2'] == 'A'
))
&
((
df['Entity1'] == 'B'
)
|
(
df['Entity2'] == 'B'
))
]
Whichs works, but I am not sure if it's the most efficient since it's checking each column twice.
CodePudding user response:
One way by comparing swapped columns:
v = df.values
df[~(v == v[:, ::-1]).any(axis=1)]
Or even better if you simply want to drop rows having same element:
df[df["Entity1"].ne(df["Entity2"])]
Validation:
df = pd.DataFrame(np.random.choice([*"AB"], (10000, 2)), columns=["Entity1", "Entity2"])
# OP's method
c1 = df["Entity1"].eq("A")
c2 = df["Entity2"].eq("A")
c3 = df["Entity1"].eq("B")
c4 = df["Entity2"].eq("B")
new_df = df[((c1 | c2) & (c3 | c4))]
# This answer
v = df.values
new_df2 = df[~(v == v[:, ::-1]).any(axis=1)]
new_df.equals(new_df2)
# True
# This answer2
new_df3 = df[df["Entity1"].ne(df["Entity2"])]
new_df.equals(new_df3)
# True
Benchmark (up to 6x faster)
%%timeit
c1 = df["Entity1"].eq("A")
c2 = df["Entity2"].eq("A")
c3 = df["Entity1"].eq("B")
c4 = df["Entity2"].eq("B")
new_df = df[((c1 | c2) & (c3 | c4))]
# 1.86 ms ± 3.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
v = df.values
new_df2 = df[~(v == v[:, ::-1]).any(axis=1)]
# 447 µs ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df[df["Entity1"].ne(df["Entity2"])]
# 272 µs ± 1.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)