Home > Software engineering >  Pandas, if two values are in two columns, but it doesn't matter which value is in which column
Pandas, if two values are in two columns, but it doesn't matter which value is in which column

Time:12-15

I would like to check if two columns in my pandas dataframe contain two values, but it doesn't matter which column the values are in.

I came up with this

df[

((
    df['Entity1'] == 'A'
) 
|
(
    df['Entity2'] == 'A'
))
&
((
    df['Entity1'] == 'B'
) 
|
(
    df['Entity2'] == 'B'
))

]

Whichs works, but I am not sure if it's the most efficient since it's checking each column twice.

CodePudding user response:

One way by comparing swapped columns:

v = df.values
df[~(v == v[:, ::-1]).any(axis=1)]

Or even better if you simply want to drop rows having same element:

df[df["Entity1"].ne(df["Entity2"])]

Validation:

df = pd.DataFrame(np.random.choice([*"AB"], (10000, 2)), columns=["Entity1", "Entity2"])

# OP's method
c1 = df["Entity1"].eq("A")
c2 = df["Entity2"].eq("A")
c3 = df["Entity1"].eq("B")
c4 = df["Entity2"].eq("B")

new_df = df[((c1 | c2) & (c3 | c4))]

# This answer
v = df.values
new_df2 = df[~(v == v[:, ::-1]).any(axis=1)]

new_df.equals(new_df2)
# True

# This answer2
new_df3 = df[df["Entity1"].ne(df["Entity2"])]

new_df.equals(new_df3)
# True

Benchmark (up to 6x faster)

%%timeit
c1 = df["Entity1"].eq("A")
c2 = df["Entity2"].eq("A")
c3 = df["Entity1"].eq("B")
c4 = df["Entity2"].eq("B")

new_df = df[((c1 | c2) & (c3 | c4))]
# 1.86 ms ± 3.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit 
v = df.values
new_df2 = df[~(v == v[:, ::-1]).any(axis=1)]
# 447 µs ± 1.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df[df["Entity1"].ne(df["Entity2"])]
# 272 µs ± 1.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • Related