I'm trying to find an efficient way of finding duplicated rows when a value represents a set of values:
df = pd.DataFrame(
{
"ID": ["one", "two", "two", "two", "one"],
"condition1": ["all", "red", "all", "red", "red"],
"condition2": ["yellow", "black", "black", "orange", "black"],
}
)
In this case, 'all' represent all colors in the present column, so the ID 'two' is duplicated because we have: [two,red,black] and [two,all,black] where "all" can be black. But the key "one" is not duplicated, because in the second condition is not the same color.
The best way I figured out is exploding the columns and calling duplicated(), the problem is that this solution makes an extremely large dataframe in normal use cases.
df.loc[df.condition1=='all','condition1'] = set(df.condition1)-{'all'}
df.loc[df.condition2=='all','condition2'] = set(df.condition2)-{'all'}
df = df.explode('condition1')
df = df.explode('condition2')
df.duplicated()
Is there any more standard/ efficient way to solve this? I was also thinking in a groupby that accepts the keyword 'all' as all possibilities, or maybe iterating by IDs and individually checking where values are 'all'.
Edit: maybe this solution, splitting and iterated per ID, could do the trick well enough, but maybe there are still better ways.
CodePudding user response:
Here is my take on your interesting question.
With the toy dataframe you provided, modified to take into account more use cases:
import pandas as pd
df = pd.DataFrame(
{
"ID": ["one", "two", "two", "one", "one", "two", "one"],
"condition1": ["all", "red", "all", "blue", "red", "brown", "all"],
"condition2": ["yellow", "black", "black", "yellow", "all", "orange", "all"],
}
)
print(df)
# Output
ID condition1 condition2
0 one all yellow
1 two red black
2 two all black
3 one blue yellow
4 one red all
5 two brown orange
6 one all all
Here is another way to do it, which, on my machine, is slower (0.014 second in average on 500 executions, versus 0.002 second for your approach) but consumes much less memory, since it relies on a generator expression, and provides more understandable results:
def find_duplicates(df, col, other_col):
all_condition1 = [
condition for condition in df[col].unique().tolist() if condition != "all"
]
all_condition2 = [
condition for condition in df[other_col].unique().tolist() if condition != "all"
]
duplicates = pd.concat(
(
df.assign(temp1=df[col].mask(df[col] == "all", condition1))
.assign(temp2=df[other_col].mask(df[other_col] == "all", condition2))
.pipe(
lambda df_: df_[
df_.duplicated(subset=["ID", "temp1", "temp2"], keep=False)
].drop(columns=["temp1", "temp2"])
)
for condition1 in all_condition1
for condition2 in all_condition2
)
)
return duplicates[~duplicates.index.duplicated(keep="first")].sort_index()
duplicates = find_duplicates(df, "condition1", "condition2")
print(duplicates)
# Output
ID condition1 condition2
0 one all yellow
1 two red black
2 two all black
3 one blue yellow
4 one red all
6 one all all