I am building a tweet classification model, and I am having trouble finding a regex pattern that fits what I am looking for. what I want the regex pattern to pick up:
- Any hashtags used in the tweet but without the hash mark (example - #omg to just omg)
- Any mentions used in the tweet but without the @ symbol (example - @username to just username)
- I don't want any numbers or any words containing numbers returned ( this is the most difficult task for me)
- Other than that, I just want all words returned
Thank you in advance if you can help
Currently I am using this pattern:** r"(?u)\b\w\w \b"** but it is failing to remove numbers.
CodePudding user response:
import re
tweet = "#omg @username I can't believe it's not butter! #butter #123 786 #one1"
# Define the regular expression
regex = r"(?u)\b(?<=\#)\w (?=\b)|(?<=@)\w (?=\b)"
# Extract the hashtags and mentions
hashtags_and_mentions = re.findall(regex, tweet)
# Print the results
print(hashtags_and_mentions) # Output: ['omg', 'username', 'butter']
CodePudding user response:
This regex should work.
(#|@)?(?![^ ]*\d[^ ]*)([^ ] )
Explanation:
(#|@)?
: A 'hash' or 'at' character. Match 0 or 1 times.
(?!)
: Negative lookahead. Check ahead to see if the pattern in the brackets matches. If so, negate the match.
[^ ]*\d[^ ]*
: any number of not space characters, followed by a digit, followed by any number of space characters. This is nested in the negative lookahead, so if a number is found in the username or hashtag, the match is negated.
([^ ] )
: One or more not space characters. The negative lookahead is a 0-length match, so if it passes, fetch the rest of the username/hashtag (Grouped with brackets so you can replace with $2
).