Home > database >  I have column of URLs and need to identify which contains 'ru' (meaning russian website).B
I have column of URLs and need to identify which contains 'ru' (meaning russian website).B

Time:03-29

I have a dataset that looks like this:

ID URL
1 example.ae/ru/page2
2 example.rubin.com
3 NaN
4 example-ru/example

I need to assign 1 to URL if it has 'ru' (I am looking for russian links), or to 0 if it is not russian.

ID URL 'ru'
1 example.ae/ru/page2 1
2 example.rubin.com 0
3 NaN 0
4 example-ru/example 1

I used this:

df['URL'].str.contains(r'-ru|/ru|.ru|')  #for 1
df['URL'].str.contains(r'(?!-ru) |(?!/ru) |(?!.ru)') #for 0

However, this doesn't work, it still selects urls like 'example-rubin.com'

CodePudding user response:

You could match either / . or - followed by ru and a word boundary

[/.-]ru\b

For example:

df['ru'] = df.apply(lambda row: 1 if re.search(r'[/.-]ru\b', row['URL']) else 0 , axis=1)
  • Related