detect overlapping values in 2 dataframes-CodePudding

df1 = pd.DataFrame({"fields": [["boy", "apple", "toy", "orange", "bear", "eat"], 
                              ["orange", "girl", "red"]]})
df2 = pd.DataFrame({"other fields": [["boy", "girl", "orange"]})

and I want to add a column to df1 indicating that the fields overlap with other fields, sample output:

|fields| overlap?|
|------|---------|
|boy   |Y
|apple |N
|toy   |N
|orange|Y
|bear  |N
|eat   |N
|orange|N
|girl  |Y
|red   |N

first I will explode fields on df1, but I am not sure what the next steps are to check overlap values between 2 dataframes. Thanks!

CodePudding user response：

You can also do it without apply. As you said you can explode, then using isin you can check whether values exist in df2 which will return True / False and then mapping 'Y' / 'N' on that:

df1_exp = df1.explode('fields')
df1_exp['overlap'] = df1_exp['fields'].isin(df2['other fields']).map({True:'Y',False:'N'})


   fields overlap
0     boy       Y
0   apple       N
0     toy       N
0  orange       Y
0    bear       N
0     eat       N
1  orange       Y
1    girl       Y
1     red       N

CodePudding user response：

this should work

df1 = df1.explode("fields")
df1["overlap"] = df1["fields"].apply(lambda x: "Y" if x in df2["other fields"].values else "N")

    fields  overlap
0   boy     Y
0   apple   N
0   toy     N
0   orange  Y
0   bear    N
0   eat     N
1   orange  Y
1   girl    Y
1   red     N

CodePudding user response：

You can use isin to find overlapping values after exploding both dfs and change the bool to Y/N using np.where

df1 = pd.DataFrame({"fields": [["boy", "apple", "toy", "orange", "bear", "eat"], ["orange", "girl", "red"]]})
df2 = pd.DataFrame({"other fields": [["boy", "girl", "orange"]]})

df1 = df1.explode('fields', ignore_index=True)
df1['overlap'] = np.where(df1['fields'].isin(df2['other fields'].explode()), 'Y', 'N')
print(df1)

Output

   fields overlap
0     boy       Y
1   apple       N
2     toy       N
3  orange       Y
4    bear       N
5     eat       N
6  orange       Y
7    girl       Y
8     red       N

CodePudding user response：

You can try .isin():

df1 = df1.explode("fields")
df1["overlap"] = df1["fields"].isin(df2["other fields"][0])

You can later replace the True/False with Y/N

CodePudding user response：

Another way is using np.select. I normally use it for huge data sets where some methods may take a while to be executed:

df1 = df1.explode(column='fields')
df1['overlap'] = np.select([df1.fields.isin(df2['other fields'])], ['Y'], 'N')

   index  fields overlap
0      0     boy       Y
1      0   apple       N
2      0     toy       N
3      0  orange       Y
4      0    bear       N
5      0     eat       N
6      1  orange       Y
7      1    girl       Y
8      1     red       N