Given a tweet dataset from this link which has a content
column as follows:
I hope to add one new column to identify whether or not the tweet mentioned Trump. The regex patern (^|[^A-Za-z0-9])Trump([^A-Za-z0-9]|$)
seems work out, but I don't understand well. I've tested with the code below:
Test1 gives the output since it's matched:
txt1 = "anti-Trump protesters"
re.search("(^|[^A-Za-z0-9])Trump([^A-Za-z0-9]|$)", txt1)
Out:
<_sre.SRE_Match object; span=(4, 11), match='-Trump '>
Test2 return None since it's not matched as expected:
txt2 = 'I got Trumped'
re.search("(^|[^A-Za-z0-9])Trump([^A-Za-z0-9]|$)", txt2)
Someone could help to explain a little bit about this pattern. Many thanks at advance.
CodePudding user response:
The (^|[^A-Za-z0-9])
portion has |
, which means “or”. The left side, the ^
, is the start of the string. The right side, [^A-Za-z0-9]
, matches any character that is not a letter or a number. In short, it matches when “Trump” is at the start of the string, or is preceded by a non-alphanumeric character.
The ([^A-Za-z0-9]|$)
follows a similar pattern, where the left side matches any character that is not a letter or a number. The right side, the $
matches the end of the string. Likewise, it matches when “Trump” is at the end of the string or is followed by a non-alphanumeric character.
So, bottom line, it matches “Trump“ that is either at the start of the string or is preceded by any character that is not alphanumeric, as well as matches if it is also and the end of the string or is followed by a non-alphanumeric character.