I try to make a pipeline voor Twitter sentiment analysis. As usual data preprocessing is a thing...
Based on real tweets I made a dataframe with only 3 rows/tweets, for experiment goal.
What I try to do: 1: clear al @, ', http etc. from the tweet. 2: after that is done I want the cleaned tweet to replace the old tweet.
This works partially: Only a part of some tweets comes back in my dataframe. As the code does clean up the tweets, the code only places a part of the original code back. I think the problem is somewhere in the tweet conversion from string to list, but after many hours trying I am unable to fix it.
The dataframe contents looks like this (only index and 1 column: Tweet) tweets are of type string
Index Tweet
0 @justanamehere and a sentence here and a link http://www.test.com
1 @Personsname are a fraud and farce, a lying person together with the fake media. Something else Personname? suppose you work with her .. @company1 @company2 #RETWEET https://x.something"
2 @companyx @companyex1 @company3 etc. AS lot of bad words here. It is a cancelculture, these rats want to badword https://x.Something
My code:
def strip_links(text):
link_regex = re.compile('((https?):((//)|(\\\\)) ([\w\d:#@%/;$()~_?\ -=\\\.&](#!)?)*)', re.DOTALL)
links = re.findall(link_regex, text)
for link in links:
text = text.replace(link[0], ', ')
return text
def strip_all_entities(text):
entity_prefixes = ['@','#']
for separator in string.punctuation:
if separator not in entity_prefixes :
text = text.replace(separator,' ')
words = []
for word in text.split():
word = word.strip()
if word:
if word[0] not in entity_prefixes:
words.append(word)
row['Tweet'] = ' '.join(words)
return ' '.join(words)
# Code hieronder is nodig omdat de tekst in het df type str heeft. Omzetten naar een list.
for index, row in df_tweet.iterrows():
tweet = list(row['Tweet'].split(","))
for t in tweet:
strip_all_entities(strip_links(t))
This produces this:
'and a sentence here and a link' 'are a fraud and farce' '' a lying person together with the fake media Something else Personname suppose you work with her' 'etc AS lot of bad words here It is a cancelculture' 'these rats want to badword'
But in df_tweet it shows only this:
Tweet
0 and a sentence here and a link
1 a lying person together with the fake media So...
2 these rats want to badword
The expected result is:
index Tweet
0 and a sentence here and a link
1 are a fraud and farce a lying person together with the fake media
Something else Personname? suppose you work with her
2 AS lot of bad words here It is a cancelculture these rats want to
badword
Thanks for helping me out!! Cheers Jan
CodePudding user response:
try:
df.Tweet = df.Tweet\
.str.replace(r'[@#]\w*\b', '', regex=True)\
.str.replace(r'https?://\S ', '', regex=True)\
.str.replace(r'\s[#@%/;$()~_?\ -=\\\.&\'] ', '', regex=True)\
.str.strip()
Output:
Tweet
Index
0 and a sentence here and a link
1 are a fraud and farce, a lying person together with the fake media. Something else Personname? suppose you work with her
2 etc. AS lot of bad words here. It is a cancelculture, these rats want to badword
To delete only non-western characters from the tweets but keep the tweets:
df.Tweet = df.Tweet\
.apply(lambda x: ''.join([i if i.isascii() else '' for i in x]))\
.str.replace(r'[@#]\w*\b', '', regex=True)\
.str.replace(r'https?://\S ', '', regex=True)\
.str.replace(r'\s[#@%/;$()~_?\ -=\\\.&\'] ', '', regex=True)\
.str.strip()
To delete tweets containig non-western characters:
df.Tweet = df.Tweet\
.str.replace(r'[@#]\w*\b', '', regex=True)\
.str.replace(r'https?://\S ', '', regex=True)\
.str.replace(r'\s[#@%/;$()~_?\ -=\\\.&\'] ', '', regex=True)\
.str.strip()
df = df[df.Tweet.apply(lambda x: x.isascii())]
CodePudding user response:
Found solution to removing Chinese (or like so characters):
df_tweet.Tweet = df_tweet.Tweet\
.str.replace(r'[@#]\w*\b', '', regex=True)\
.str.replace(r'https?://\S ', '', regex=True)\
.str.replace(r'\s[#@%/;$()~_?\ -=\\\.&\'] ', '', regex=True)\
.str.replace(r'[^\x00-\x7f]', "", regex=True )\
.str.strip()