Python function: if link in one dataset is a part of a link in another dataset, assign 1, else 0-CodePudding

Sample datasets i have:

df1:

ID	Page Link
1	http://example1/path1/ru/path2/path3
2	https://example2.com/path1
3	https://example3.subdomain

df2:

ID	Link
1	http://example1/path1/ru
2	https://example2.com/path1
3	https://example3.subdomain/path2

in df1 I need to create a column ['Contains'], which has values 1 or 0. If df1 links are a part of links in df2, then ['Contains']=1, else 0

so that end result looks like this:

df1

ID	Page Link	Contains
1	http://example1/path1/ru/path2/path3	1
2	https://example2.com/path1	1
3	https://example3.subdomain	0

I tried this:

def assign(column):
    for link in df2['Link']:
        if re.search(link, column):
            contains=1
        else
            contains=0
    return contains

df1['Contains']=df1['Page Link'].apply(assign)

This didn't return the result I expected

leads['Marketing Team']=leads['Page Link'].apply(assign_marketing)

CodePudding user response：

Use:

import re

regex = '|'.join(r"\b{}\b".format(re.escape(x)) for x in df2.Link)
df1['Contains'] = df1['Page Link'].str.contains(regex).astype(int)

print (df1)
   ID                             Page Link  Contains
0   1  http://example1/path1/ru/path2/path3         1
1   2            https://example2.com/path1         1
2   3            https://example3.subdomain         0

If need test all links:

df1['Contains'] = [int(any(x in link for x in df2['Link'])) for link in df1['Page Link']]
print (df1)
   ID                             Page Link  Contains
0   1  http://example1/path1/ru/path2/path3         1
1   2            https://example2.com/path1         1
2   3            https://example3.subdomain         0

CodePudding user response：

The question is not highly explicit, so in case you want to check the link match per ID, you can use:

s = df1['ID'].map(df2.set_index('ID')['Link'])

df1['Contains'] = [int(b in a) if b else 0 for a,b in zip(df1['Page Link'], s)]

output:

   ID                             Page Link  Contains
0   1  http://example1/path1/ru/path2/path3         1
1   2            https://example2.com/path1         1
2   3            https://example3.subdomain         0

CodePudding user response：

The function that you use does not stop when it finds a match, and may later "change its mind." Here is a fixed version:

def assign(column):
    for link in df2['Link']:
        if re.search(link, column):
            return 1 # Return at once!
    return 0

If the match is always at the beginning, you can replace the expensive re.search() with a much faster startswith():

def assign(column):
    for link in df2['Link']:
        if link.startswith(column):
            return 1 # Return at once!
    return 0