I wrote a python script with for loops
, which is meant to extract metadata from tweets, and it worked well initially. Now, I have replaced the for loops
with list comprehension
, and my code is throwing up an error, which I can't really decipher. Here is my code below:
def tweetFeatures(tweet):
#Count the number of words in each tweet
wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
#Count the number of characters in each tweet
chars = [len(tweet.loc[k]) for k in range(len(tweet))]
#Extract the mentions in each tweet
mentions = [list(re.findall("@([a-zA-Z0-9_]{1,50})",tweet.loc[p])) for p in range(len(tweet))]
#Counts the number of mentions in each tweet
mention_count = [len(mentions[t]) for t in range(len(mentions))]
#Extracts the hashtags in each tweet
hashtags = [list(re.findall("#([a-zA-Z0-9_]{1,50})",tweet.loc[f])) for f in range(len(tweet))]
#Counts the number of hashtags in each tweet
hashtag_count = [len(hashtags[d]) for d in range(len(hashtags))]
#Extracts the urls in each tweet
url = [list(re.findall("(?P<url>https?://[^\s] )",tweet.loc[l])) for l in range(len(tweet))]
#Counts the number of urls in each tweet
url_count = [len(url[c]) for c in range(len(url))]
#Put everything into a dataframe
feats = {"n_words":wordcount,"n_chars":chars,"n_mentions":mention_count,"n_hashtag":hashtag_count,"n_url":url_count}
feats_df = pd.DataFrame(feats)
return feats_df
Here is the error I am now getting after running this line of code tweetFeatures(tweet = text_df)
AttributeError Traceback (most recent call last)
<ipython-input-22-a074a939c816> in <module>
----> 1 tweetFeatures(tweet = text_df)
<ipython-input-21-36def6dfde04> in tweetFeatures(tweet)
1 def tweetFeatures(tweet):
2 #Count the number of words in each tweet
----> 3 wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
4
5 #Count the number of characters in each tweet
<ipython-input-21-36def6dfde04> in <listcomp>(.0)
1 def tweetFeatures(tweet):
2 #Count the number of words in each tweet
----> 3 wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
4
5 #Count the number of characters in each tweet
~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5464 return self[name]
-> 5465 return object.__getattribute__(self, name)
5466
5467 def __setattr__(self, name: str, value) -> None:
AttributeError: 'Series' object has no attribute 'split'
This is the test data I created:
text = ["@Harrison2Jennifer Tokyo 2020 is so much fun. Loving every bit of it just as @MeggyJane & @Tommy620 say #Tokyo2020 https://www.corp.com",
"Gabrielle Thomas is my favourite sprinter @TooSports https://www.flick.org https://www.bugger.com",
"@Sports_head I wish the #Tokyo2020 @Olympics will never end #Athletics #Sprints",
"I hate the attitude of officials at this olympics @Kieran https://www.launch.com",
"Why can't the #Athletics be more exciting? #Tokyo2020",
"It is so much fun to see beautful colors at the #Olympics"]
I convert it to a Pandas dataframe with text_df = pd.DataFrame(text)
, and then print that with print(text_df)
, with the below result:
0
0 @Harrison2Jennifer Tokyo 2020 is so much fun. ...
1 Gabrielle Thomas is my favourite sprinter @Too...
2 @Sports_head I wish the #Tokyo2020 @Olympics w...
3 I hate the attitude of officials at this olymp...
4 Why can't the #Athletics be more exciting? #To...
5 It is so much fun to see beautful colors at th...
The code was written in Jupyter notebooks. Please, I will appreciate your helpful suggestions as to what exactly has gone wrong, thank you.
CodePudding user response:
According to your error message AttributeError: 'Series' object has no attribute 'split'
you're trying to call the String method split()
on a pandas
Series object.
wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
By looking at the test data you provided, you could do something like this in order to fix the error:
import pandas as pd
text_df = pd.DataFrame(text,columns=["tweet"])
text_df.tweet.loc[0].split()
Will return:
['@Harrison2Jennifer',
'Tokyo',
'2020',
'is',
'so',
'much',
'fun.',
'Loving',
'every',
'bit',
'of',
'it',
'just',
'as',
'@MeggyJane',
'&',
'@Tommy620',
'say',
'#Tokyo2020',
'https://www.corp.com']
Alternatively there is a solution without pandas
by passing the "raw" list of tweets and changing the list comprehension to
wordcount = [len(t.split()) for t in tweet]
CodePudding user response:
What you are doing is creating a pd.DataFrame
, but you only have a single column. In your case this column is called 0
.
So you can fix your code by either:
tweetFeatures(tweet = text_df[0])
- Creating a Series instead of a DataFrame:
text_df = pd.Series(text)
and calling it like you are doing right now.
Additionally, you can speed up your function by using apply in most cases. Note that this is a bit slower for small input such as the sample you provided, but gives a significant speed up when using more tweets:
text = ["@Harrison2Jennifer Tokyo 2020 is so much fun. Loving every bit of it just as @MeggyJane & @Tommy620 say #Tokyo2020 https://www.corp.com",
"Gabrielle Thomas is my favourite sprinter @TooSports https://www.flick.org https://www.bugger.com",
"@Sports_head I wish the #Tokyo2020 @Olympics will never end #Athletics #Sprints",
"I hate the attitude of officials at this olympics @Kieran https://www.launch.com",
"Why can't the #Athletics be more exciting? #Tokyo2020",
"It is so much fun to see beautful colors at the #Olympics"]*1000
from functools import partial
def tweetFeatures_speedup(tweet):
#Count the number of words in each tweet
wordcount = tweet.apply(lambda x: len(x.split()))
#Count the number of characters in each tweet
chars = tweet.apply(len)
#Extract the mentions in each tweet
mention_finder = partial(re.findall, "@([a-zA-Z0-9_]{1,50})")
#Counts the number of mentions in each tweet
mention_count = tweet.apply(lambda x: len(mention_finder(x)))
#Extracts the hashtags in each tweet
#Counts the number of hashtags in each tweet
hashtag_finder = partial(re.findall, "#([a-zA-Z0-9_]{1,50})")
hashtag_count = tweet.apply(lambda x: len(hashtag_finder(x)))
#Extracts the urls in each tweet
#Counts the number of urls in each tweet
url_finder = partial(re.findall, "(?P<url>https?://[^\s] )")
url_count = tweet.apply(lambda x: len(url_finder(x)))
#Put everything into a dataframe
feats = {"n_words":wordcount,"n_chars":chars,"n_mentions":mention_count,"n_hashtag":hashtag_count,"n_url":url_count}
feats_df = pd.DataFrame(feats)
return feats_df
This results in the %%timeit
comparison:
- Your version:
193 ms ± 1.95 ms per loop
- My version:
21.3 ms ± 85.1 µs