Home > Software design >  How to find duplicated rows in Pandas with a "wildcard" string value?
How to find duplicated rows in Pandas with a "wildcard" string value?

Time:07-20

I'm trying to find an efficient way of finding duplicated rows when a value represents a set of values:

df = pd.DataFrame(
    {
        "ID": ["one", "two", "two", "two", "one"],
        "condition1": ["all", "red", "all", "red", "red"],
        "condition2": ["yellow", "black", "black", "orange", "black"],
    }
)

In this case, 'all' represent all colors in the present column, so the ID 'two' is duplicated because we have: [two,red,black] and [two,all,black] where "all" can be black. But the key "one" is not duplicated, because in the second condition is not the same color.

The best way I figured out is exploding the columns and calling duplicated(), the problem is that this solution makes an extremely large dataframe in normal use cases.

df.loc[df.condition1=='all','condition1'] = set(df.condition1)-{'all'}
df.loc[df.condition2=='all','condition2'] = set(df.condition2)-{'all'}
df = df.explode('condition1')
df = df.explode('condition2')
df.duplicated()

Is there any more standard/ efficient way to solve this? I was also thinking in a groupby that accepts the keyword 'all' as all possibilities, or maybe iterating by IDs and individually checking where values are 'all'.

Edit: maybe this solution, splitting and iterated per ID, could do the trick well enough, but maybe there are still better ways.

CodePudding user response:

Here is my take on your interesting question.

With the toy dataframe you provided, modified to take into account more use cases:

import pandas as pd

df = pd.DataFrame(
    {
        "ID": ["one", "two", "two", "one", "one", "two", "one"],
        "condition1": ["all", "red", "all", "blue", "red", "brown", "all"],
        "condition2": ["yellow", "black", "black", "yellow", "all", "orange", "all"],
    }
)

print(df)
# Output
    ID condition1 condition2
0  one        all     yellow
1  two        red      black
2  two        all      black
3  one       blue     yellow
4  one        red        all
5  two      brown     orange
6  one        all        all

Here is another way to do it, which, on my machine, is slower (0.014 second in average on 500 executions, versus 0.002 second for your approach) but consumes much less memory, since it relies on a generator expression, and provides more understandable results:

def find_duplicates(df, col, other_col):
    all_condition1 = [
        condition for condition in df[col].unique().tolist() if condition != "all"
    ]
    all_condition2 = [
        condition for condition in df[other_col].unique().tolist() if condition != "all"
    ]
    duplicates = pd.concat(
        (
            df.assign(temp1=df[col].mask(df[col] == "all", condition1))
            .assign(temp2=df[other_col].mask(df[other_col] == "all", condition2))
            .pipe(
                lambda df_: df_[
                    df_.duplicated(subset=["ID", "temp1", "temp2"], keep=False)
                ].drop(columns=["temp1", "temp2"])
            )
            for condition1 in all_condition1
            for condition2 in all_condition2
        )
    )
    return duplicates[~duplicates.index.duplicated(keep="first")].sort_index()
duplicates = find_duplicates(df, "condition1", "condition2")

print(duplicates)
# Output
    ID condition1 condition2
0  one        all     yellow
1  two        red      black
2  two        all      black
3  one       blue     yellow
4  one        red        all
6  one        all        all
  • Related