How can I get my python code to run again-CodePudding

I wrote a python script with for loops, which is meant to extract metadata from tweets, and it worked well initially. Now, I have replaced the for loops with list comprehension, and my code is throwing up an error, which I can't really decipher. Here is my code below:

def tweetFeatures(tweet):
        #Count the number of words in each tweet
        wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
        
        #Count the number of characters in each tweet
        chars = [len(tweet.loc[k]) for k in range(len(tweet))]
        
        #Extract the mentions in each tweet
        mentions = [list(re.findall("@([a-zA-Z0-9_]{1,50})",tweet.loc[p])) for p in range(len(tweet))]
        
        #Counts the number of mentions in each tweet 
        mention_count = [len(mentions[t]) for t in range(len(mentions))]
        
        #Extracts the hashtags in each tweet    
        hashtags = [list(re.findall("#([a-zA-Z0-9_]{1,50})",tweet.loc[f])) for f in range(len(tweet))]
        
        #Counts the number of hashtags in each tweet    
        hashtag_count = [len(hashtags[d]) for d in range(len(hashtags))]
        
        #Extracts the urls in each tweet
        url = [list(re.findall("(?P<url>https?://[^\s] )",tweet.loc[l])) for l in range(len(tweet))]
        
        #Counts the number of urls in each tweet
        url_count = [len(url[c]) for c in range(len(url))]
        
        #Put everything into a dataframe
        feats = {"n_words":wordcount,"n_chars":chars,"n_mentions":mention_count,"n_hashtag":hashtag_count,"n_url":url_count}
        feats_df = pd.DataFrame(feats)

        return feats_df

Here is the error I am now getting after running this line of code tweetFeatures(tweet = text_df)

AttributeError                            Traceback (most recent call last)
<ipython-input-22-a074a939c816> in <module>
----> 1 tweetFeatures(tweet = text_df)

<ipython-input-21-36def6dfde04> in tweetFeatures(tweet)
      1 def tweetFeatures(tweet):
      2         #Count the number of words in each tweet
----> 3         wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
      4 
      5         #Count the number of characters in each tweet

<ipython-input-21-36def6dfde04> in <listcomp>(.0)
      1 def tweetFeatures(tweet):
      2         #Count the number of words in each tweet
----> 3         wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
      4 
      5         #Count the number of characters in each tweet

~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5464                 return self[name]
-> 5465             return object.__getattribute__(self, name)
   5466 
   5467     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'split'

This is the test data I created:

text = ["@Harrison2Jennifer Tokyo 2020 is so much fun. Loving every bit of it just as @MeggyJane & @Tommy620 say #Tokyo2020 https://www.corp.com",
           "Gabrielle Thomas is my favourite sprinter @TooSports https://www.flick.org https://www.bugger.com",
           "@Sports_head I wish the #Tokyo2020 @Olympics will never end #Athletics #Sprints",
           "I hate the attitude of officials at this olympics @Kieran https://www.launch.com",
           "Why can't the #Athletics be more exciting? #Tokyo2020",
           "It is so much fun to see beautful colors at the #Olympics"]

I convert it to a Pandas dataframe with text_df = pd.DataFrame(text), and then print that with print(text_df), with the below result:

0
0   @Harrison2Jennifer Tokyo 2020 is so much fun. ...
1   Gabrielle Thomas is my favourite sprinter @Too...
2   @Sports_head I wish the #Tokyo2020 @Olympics w...
3   I hate the attitude of officials at this olymp...
4   Why can't the #Athletics be more exciting? #To...
5   It is so much fun to see beautful colors at th...

The code was written in Jupyter notebooks. Please, I will appreciate your helpful suggestions as to what exactly has gone wrong, thank you.

CodePudding user response：

According to your error message AttributeError: 'Series' object has no attribute 'split' you're trying to call the String method split() on a pandas Series object.

wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]

By looking at the test data you provided, you could do something like this in order to fix the error:

import pandas as pd

text_df = pd.DataFrame(text,columns=["tweet"])

text_df.tweet.loc[0].split()

Will return:

['@Harrison2Jennifer',
 'Tokyo',
 '2020',
 'is',
 'so',
 'much',
 'fun.',
 'Loving',
 'every',
 'bit',
 'of',
 'it',
 'just',
 'as',
 '@MeggyJane',
 '&',
 '@Tommy620',
 'say',
 '#Tokyo2020',
 'https://www.corp.com']

Alternatively there is a solution without pandas by passing the "raw" list of tweets and changing the list comprehension to

wordcount = [len(t.split()) for t in tweet]

CodePudding user response：

What you are doing is creating a pd.DataFrame, but you only have a single column. In your case this column is called 0.

So you can fix your code by either:

tweetFeatures(tweet = text_df[0])
Creating a Series instead of a DataFrame: text_df = pd.Series(text) and calling it like you are doing right now.

Additionally, you can speed up your function by using apply in most cases. Note that this is a bit slower for small input such as the sample you provided, but gives a significant speed up when using more tweets:

text = ["@Harrison2Jennifer Tokyo 2020 is so much fun. Loving every bit of it just as @MeggyJane & @Tommy620 say #Tokyo2020 https://www.corp.com",
       "Gabrielle Thomas is my favourite sprinter @TooSports https://www.flick.org https://www.bugger.com",
       "@Sports_head I wish the #Tokyo2020 @Olympics will never end #Athletics #Sprints",
       "I hate the attitude of officials at this olympics @Kieran https://www.launch.com",
       "Why can't the #Athletics be more exciting? #Tokyo2020",
       "It is so much fun to see beautful colors at the #Olympics"]*1000

from functools import partial
def tweetFeatures_speedup(tweet):
    #Count the number of words in each tweet
    wordcount = tweet.apply(lambda x: len(x.split()))
    
    #Count the number of characters in each tweet
    chars = tweet.apply(len)
    
    #Extract the mentions in each tweet
    mention_finder = partial(re.findall, "@([a-zA-Z0-9_]{1,50})")
    
    #Counts the number of mentions in each tweet
    mention_count = tweet.apply(lambda x: len(mention_finder(x)))
    
    #Extracts the hashtags in each tweet    
    #Counts the number of hashtags in each tweet   
    hashtag_finder = partial(re.findall, "#([a-zA-Z0-9_]{1,50})")
    hashtag_count = tweet.apply(lambda x: len(hashtag_finder(x)))
    
    #Extracts the urls in each tweet
    #Counts the number of urls in each tweet
    url_finder = partial(re.findall, "(?P<url>https?://[^\s] )")
    url_count = tweet.apply(lambda x: len(url_finder(x)))
    
    #Put everything into a dataframe
    feats = {"n_words":wordcount,"n_chars":chars,"n_mentions":mention_count,"n_hashtag":hashtag_count,"n_url":url_count}
    feats_df = pd.DataFrame(feats)

    return feats_df

This results in the %%timeit comparison:

Your version: 193 ms ± 1.95 ms per loop
My version: 21.3 ms ± 85.1 µs