Sample datasets i have:
df1:
ID | Page Link |
---|---|
1 | http://example1/path1/ru/path2/path3 |
2 | https://example2.com/path1 |
3 | https://example3.subdomain |
df2:
ID | Link |
---|---|
1 | http://example1/path1/ru |
2 | https://example2.com/path1 |
3 | https://example3.subdomain/path2 |
in df1 I need to create a column ['Contains'], which has values 1 or 0. If df1 links are a part of links in df2, then ['Contains']=1, else 0
so that end result looks like this:
df1
ID | Page Link | Contains |
---|---|---|
1 | http://example1/path1/ru/path2/path3 | 1 |
2 | https://example2.com/path1 | 1 |
3 | https://example3.subdomain | 0 |
I tried this:
def assign(column):
for link in df2['Link']:
if re.search(link, column):
contains=1
else
contains=0
return contains
df1['Contains']=df1['Page Link'].apply(assign)
This didn't return the result I expected
leads['Marketing Team']=leads['Page Link'].apply(assign_marketing)
CodePudding user response:
Use:
import re
regex = '|'.join(r"\b{}\b".format(re.escape(x)) for x in df2.Link)
df1['Contains'] = df1['Page Link'].str.contains(regex).astype(int)
print (df1)
ID Page Link Contains
0 1 http://example1/path1/ru/path2/path3 1
1 2 https://example2.com/path1 1
2 3 https://example3.subdomain 0
If need test all links:
df1['Contains'] = [int(any(x in link for x in df2['Link'])) for link in df1['Page Link']]
print (df1)
ID Page Link Contains
0 1 http://example1/path1/ru/path2/path3 1
1 2 https://example2.com/path1 1
2 3 https://example3.subdomain 0
CodePudding user response:
The question is not highly explicit, so in case you want to check the link match per ID, you can use:
s = df1['ID'].map(df2.set_index('ID')['Link'])
df1['Contains'] = [int(b in a) if b else 0 for a,b in zip(df1['Page Link'], s)]
output:
ID Page Link Contains
0 1 http://example1/path1/ru/path2/path3 1
1 2 https://example2.com/path1 1
2 3 https://example3.subdomain 0
CodePudding user response:
The function that you use does not stop when it finds a match, and may later "change its mind." Here is a fixed version:
def assign(column):
for link in df2['Link']:
if re.search(link, column):
return 1 # Return at once!
return 0
If the match is always at the beginning, you can replace the expensive re.search()
with a much faster startswith()
:
def assign(column):
for link in df2['Link']:
if link.startswith(column):
return 1 # Return at once!
return 0