I have a dataset that looks like this:
ID | URL |
---|---|
1 | example.ae/ru/page2 |
2 | example.rubin.com |
3 | NaN |
4 | example-ru/example |
I need to assign 1 to URL if it has 'ru' (I am looking for russian links), or to 0 if it is not russian.
ID | URL | 'ru' |
---|---|---|
1 | example.ae/ru/page2 | 1 |
2 | example.rubin.com | 0 |
3 | NaN | 0 |
4 | example-ru/example | 1 |
I used this:
df['URL'].str.contains(r'-ru|/ru|.ru|') #for 1
df['URL'].str.contains(r'(?!-ru) |(?!/ru) |(?!.ru)') #for 0
However, this doesn't work, it still selects urls like 'example-rubin.com'
CodePudding user response:
You could match either /
.
or -
followed by ru
and a word boundary
[/.-]ru\b
For example:
df['ru'] = df.apply(lambda row: 1 if re.search(r'[/.-]ru\b', row['URL']) else 0 , axis=1)