Home > Enterprise >  detect overlapping values in 2 dataframes
detect overlapping values in 2 dataframes

Time:09-15

df1 = pd.DataFrame({"fields": [["boy", "apple", "toy", "orange", "bear", "eat"], 
                              ["orange", "girl", "red"]]})
df2 = pd.DataFrame({"other fields": [["boy", "girl", "orange"]}) 

and I want to add a column to df1 indicating that the fields overlap with other fields, sample output:

|fields| overlap?|
|------|---------|
|boy   |Y
|apple |N
|toy   |N
|orange|Y
|bear  |N
|eat   |N
|orange|N
|girl  |Y
|red   |N

first I will explode fields on df1, but I am not sure what the next steps are to check overlap values between 2 dataframes. Thanks!

CodePudding user response:

You can also do it without apply. As you said you can explode, then using isin you can check whether values exist in df2 which will return True / False and then mapping 'Y' / 'N' on that:

df1_exp = df1.explode('fields')
df1_exp['overlap'] = df1_exp['fields'].isin(df2['other fields']).map({True:'Y',False:'N'})


   fields overlap
0     boy       Y
0   apple       N
0     toy       N
0  orange       Y
0    bear       N
0     eat       N
1  orange       Y
1    girl       Y
1     red       N

CodePudding user response:

this should work

df1 = df1.explode("fields")
df1["overlap"] = df1["fields"].apply(lambda x: "Y" if x in df2["other fields"].values else "N")
    fields  overlap
0   boy     Y
0   apple   N
0   toy     N
0   orange  Y
0   bear    N
0   eat     N
1   orange  Y
1   girl    Y
1   red     N

CodePudding user response:

You can use isin to find overlapping values after exploding both dfs and change the bool to Y/N using np.where

df1 = pd.DataFrame({"fields": [["boy", "apple", "toy", "orange", "bear", "eat"], ["orange", "girl", "red"]]})
df2 = pd.DataFrame({"other fields": [["boy", "girl", "orange"]]})

df1 = df1.explode('fields', ignore_index=True)
df1['overlap'] = np.where(df1['fields'].isin(df2['other fields'].explode()), 'Y', 'N')
print(df1)

Output

   fields overlap
0     boy       Y
1   apple       N
2     toy       N
3  orange       Y
4    bear       N
5     eat       N
6  orange       Y
7    girl       Y
8     red       N

CodePudding user response:

You can try .isin():

df1 = df1.explode("fields")
df1["overlap"] = df1["fields"].isin(df2["other fields"][0])

You can later replace the True/False with Y/N

CodePudding user response:

Another way is using np.select. I normally use it for huge data sets where some methods may take a while to be executed:

df1 = df1.explode(column='fields')
df1['overlap'] = np.select([df1.fields.isin(df2['other fields'])], ['Y'], 'N')

   index  fields overlap
0      0     boy       Y
1      0   apple       N
2      0     toy       N
3      0  orange       Y
4      0    bear       N
5      0     eat       N
6      1  orange       Y
7      1    girl       Y
8      1     red       N
  • Related