I'm writing a function to filter tweet data that contains search word. Here's my code:
def twitter_filter(df, search):
coun = 0
date_ls = []
id_ls = []
content_ls = []
lan_ls = []
name_ls = []
retweet_ls = []
cleaned_tweet_ls = []
for i, row in df.iterrows():
if search in row.cleaned_tweet:
date_ls.append(row.date)
id_ls.append(row.id)
content_ls.append(row.content)
lan_ls.append(row.language)
name_ls.append(row.name)
retweet_ls.append(row.retweet)
cleaned_tweet_ls.append(row.cleaned_tweet)
new_dict = {
"date": date_ls,
"id": id_ls,
"content": content_ls,
"lan" : lan_ls,
"name" : name_ls,
"retweet" : retweet_ls,
"cleaned_tweeet": cleaned_tweet_ls,
}
new_df = pd.DataFrame(new_dict)
return new_df
Before filter:
cleandf['name']
Out[6]:
0 PryZmRuleZZ
1 Arbitration111
2 4kjweed
3 THEREALCAMOJOE
5 DailyBSC_
130997 Rabbitdogebsc
130999 gmtowner
131000 topcryptostats
131001 vGhostvRiderv
131002 gmtowner
Name: name, Length: 98177, dtype: object
After filter, user's name becomes random integer:
cleanedogetweet['name']
Out[7]:
0 3
1 5
2 9
3 12
4 34
80779 130997
80780 130999
80781 131000
80782 131001
80783 131002
Name: name, Length: 80784, dtype: int64
This problem only happened in user's name columns, other columns that contains string are ok.
I expected to remain the original user name, how can i solve the problem ?
CodePudding user response:
In pandas dataframes, each row has an attribute called name
.
You can use the name
attribute to get the name of the row. By default, the name of the row is the index of the row.
So it's better that your column name would not be name
because it will conflict with the name
attribute of the row.
You can use the rename
method to rename the column name and use another name like username
, or you can change your function to this:
def twitter_filter(df, search):
coun = 0
date_ls = []
id_ls = []
content_ls = []
lan_ls = []
name_ls = []
retweet_ls = []
cleaned_tweet_ls = []
for i, row in df.iterrows():
if search in row.cleaned_tweet:
date_ls.append(row['date'])
id_ls.append(row['id'])
content_ls.append(row['content'])
lan_ls.append(row['language'])
name_ls.append(row['name'])
retweet_ls.append(row['retweet'])
cleaned_tweet_ls.append(row['cleaned_tweet'])
new_dict = {
"date": date_ls,
"id": id_ls,
"content": content_ls,
"lan": lan_ls,
"user_name": name_ls,
"retweet": retweet_ls,
"cleaned_tweeet": cleaned_tweet_ls,
}
new_df = pd.DataFrame(new_dict)
return new_df