df1 = pd.DataFrame({"fields": [["boy", "apple", "toy", "orange", "bear", "eat"],
["orange", "girl", "red"]]})
df2 = pd.DataFrame({"other fields": [["boy", "girl", "orange"]})
and I want to add a column to df1 indicating that the fields overlap with other fields, sample output:
|fields| overlap?|
|------|---------|
|boy |Y
|apple |N
|toy |N
|orange|Y
|bear |N
|eat |N
|orange|N
|girl |Y
|red |N
first I will explode fields on df1, but I am not sure what the next steps are to check overlap values between 2 dataframes. Thanks!
CodePudding user response:
You can also do it without apply
. As you said you can explode
, then using isin
you can check whether values exist in df2 which will return True
/ False
and then mapping 'Y' / 'N' on that:
df1_exp = df1.explode('fields')
df1_exp['overlap'] = df1_exp['fields'].isin(df2['other fields']).map({True:'Y',False:'N'})
fields overlap
0 boy Y
0 apple N
0 toy N
0 orange Y
0 bear N
0 eat N
1 orange Y
1 girl Y
1 red N
CodePudding user response:
this should work
df1 = df1.explode("fields")
df1["overlap"] = df1["fields"].apply(lambda x: "Y" if x in df2["other fields"].values else "N")
fields overlap
0 boy Y
0 apple N
0 toy N
0 orange Y
0 bear N
0 eat N
1 orange Y
1 girl Y
1 red N
CodePudding user response:
You can use isin
to find overlapping values after exploding both df
s and change the bool
to Y
/N
using np.where
df1 = pd.DataFrame({"fields": [["boy", "apple", "toy", "orange", "bear", "eat"], ["orange", "girl", "red"]]})
df2 = pd.DataFrame({"other fields": [["boy", "girl", "orange"]]})
df1 = df1.explode('fields', ignore_index=True)
df1['overlap'] = np.where(df1['fields'].isin(df2['other fields'].explode()), 'Y', 'N')
print(df1)
Output
fields overlap
0 boy Y
1 apple N
2 toy N
3 orange Y
4 bear N
5 eat N
6 orange Y
7 girl Y
8 red N
CodePudding user response:
You can try .isin()
:
df1 = df1.explode("fields")
df1["overlap"] = df1["fields"].isin(df2["other fields"][0])
You can later replace the True/False with Y/N
CodePudding user response:
Another way is using np.select
. I normally use it for huge data sets where some methods may take a while to be executed:
df1 = df1.explode(column='fields')
df1['overlap'] = np.select([df1.fields.isin(df2['other fields'])], ['Y'], 'N')
index fields overlap
0 0 boy Y
1 0 apple N
2 0 toy N
3 0 orange Y
4 0 bear N
5 0 eat N
6 1 orange Y
7 1 girl Y
8 1 red N