I have column of URLs and need to identify which contains 'ru' (meaning russian website).B-CodePudding

I have a dataset that looks like this:

I need to assign 1 to URL if it has 'ru' (I am looking for russian links), or to 0 if it is not russian.

I used this:

df['URL'].str.contains(r'-ru|/ru|.ru|')  #for 1
df['URL'].str.contains(r'(?!-ru) |(?!/ru) |(?!.ru)') #for 0

However, this doesn't work, it still selects urls like 'example-rubin.com'

CodePudding user response：

You could match either / . or - followed by ru and a word boundary

[/.-]ru\b

For example:

df['ru'] = df.apply(lambda row: 1 if re.search(r'[/.-]ru\b', row['URL']) else 0 , axis=1)