Home > front end >  Changing dataframe values after regex function problem
Changing dataframe values after regex function problem

Time:08-06

I try to make a pipeline voor Twitter sentiment analysis. As usual data preprocessing is a thing...

Based on real tweets I made a dataframe with only 3 rows/tweets, for experiment goal.

What I try to do: 1: clear al @, ', http etc. from the tweet. 2: after that is done I want the cleaned tweet to replace the old tweet.

This works partially: Only a part of some tweets comes back in my dataframe. As the code does clean up the tweets, the code only places a part of the original code back. I think the problem is somewhere in the tweet conversion from string to list, but after many hours trying I am unable to fix it.

The dataframe contents looks like this (only index and 1 column: Tweet) tweets are of type string

Index   Tweet
0       @justanamehere and a sentence here and a link http://www.test.com
1       @Personsname are a fraud and farce, a lying person together with the fake media. Something else Personname? suppose you work with her .. @company1 @company2 #RETWEET https://x.something"
2      @companyx @companyex1 @company3 etc. AS lot of bad words here. It is a cancelculture, these rats want to badword https://x.Something

My code:

def strip_links(text):
            link_regex    = re.compile('((https?):((//)|(\\\\)) ([\w\d:#@%/;$()~_?\ -=\\\.&](#!)?)*)', re.DOTALL)
            links         = re.findall(link_regex, text)
            for link in links:
                text = text.replace(link[0], ', ')    
            return text

def strip_all_entities(text):
            entity_prefixes = ['@','#']
            for separator in  string.punctuation:
                if separator not in entity_prefixes :
                    text = text.replace(separator,' ')
            words = []
            for word in text.split():
                word = word.strip()
                if word:
                    if word[0] not in entity_prefixes:
                        words.append(word)
            row['Tweet'] = ' '.join(words)   
                 
            return ' '.join(words)


# Code hieronder is nodig omdat de tekst in het df type str heeft. Omzetten naar een list.

for index, row in df_tweet.iterrows():
  tweet = list(row['Tweet'].split(","))
      
  for t in tweet: 
    strip_all_entities(strip_links(t))   

This produces this:

'and a sentence here and a link' 'are a fraud and farce' '' a lying person together with the fake media Something else Personname suppose you work with her' 'etc AS lot of bad words here It is a cancelculture' 'these rats want to badword'

But in df_tweet it shows only this:

    Tweet
0   and a sentence here and a link
1   a lying person together with the fake media So...
2   these rats want to badword

The expected result is:

index   Tweet
0       and a sentence here and a link
1       are a fraud and farce a lying person together with the fake media 
        Something else Personname? suppose you work with her
2       AS lot of bad words here It is a cancelculture these rats want to 
        badword

Thanks for helping me out!! Cheers Jan

CodePudding user response:

try:

df.Tweet = df.Tweet\
    .str.replace(r'[@#]\w*\b', '', regex=True)\
    .str.replace(r'https?://\S ', '', regex=True)\
    .str.replace(r'\s[#@%/;$()~_?\ -=\\\.&\'] ', '', regex=True)\
    .str.strip()

Output:

        Tweet
Index   
0       and a sentence here and a link
1       are a fraud and farce, a lying person together with the fake media. Something else Personname? suppose you work with her
2       etc. AS lot of bad words here. It is a cancelculture, these rats want to badword

To delete only non-western characters from the tweets but keep the tweets:

df.Tweet = df.Tweet\
    .apply(lambda x: ''.join([i if i.isascii() else '' for i in x]))\
    .str.replace(r'[@#]\w*\b', '', regex=True)\
    .str.replace(r'https?://\S ', '', regex=True)\
    .str.replace(r'\s[#@%/;$()~_?\ -=\\\.&\'] ', '', regex=True)\
    .str.strip()

To delete tweets containig non-western characters:

df.Tweet = df.Tweet\
    .str.replace(r'[@#]\w*\b', '', regex=True)\
    .str.replace(r'https?://\S ', '', regex=True)\
    .str.replace(r'\s[#@%/;$()~_?\ -=\\\.&\'] ', '', regex=True)\
    .str.strip()
df = df[df.Tweet.apply(lambda x: x.isascii())]

CodePudding user response:

Found solution to removing Chinese (or like so characters):

df_tweet.Tweet = df_tweet.Tweet\
    .str.replace(r'[@#]\w*\b', '', regex=True)\
    .str.replace(r'https?://\S ', '', regex=True)\
    .str.replace(r'\s[#@%/;$()~_?\ -=\\\.&\'] ', '', regex=True)\
    .str.replace(r'[^\x00-\x7f]', "", regex=True )\
    .str.strip()
  • Related