I have a dataset called data_set_tweets.csv as below
created_at,tweet,retweet_count
7/29/2021 2:40,Great Sunny day for Cricket at London,3
7/29/2021 10:40,Great Score put on by England batting,0
7/29/2021 11:50,England won the match,1
And what I was trying to do is to get the below output in to a data frame.
Which means I want to iterate the text in the tweet column based on the retweet_count value with same created_at values on that particular tweet
Below is the expected output for my dataset
created_at,tweet
7/29/2021 2:40,Great Sunny day for Cricket at London
7/29/2021 2:40,Great Sunny day for Cricket at London
7/29/2021 2:40,Great Sunny day for Cricket at London
7/29/2021 2:40,Great Sunny day for Cricket at London
7/29/2021 10:40,Great Score put on by England batting
7/29/2021 11:50,England won the match
7/29/2021 11:50,England won the match
Below is how I started my approach
import pandas as pd
def iterateTweets():
tweets = pd.read_csv(r'data_set_tweets.csv')
df = pd.DataFrame(tweets, columns=['created_at', 'tweet', 'retweet_count'])
df['created_at'] = pd.to_datetime(df['created_at'])
df['tweet'] = df['tweet'].apply(lambda x: str(x))
df['retweet_count'] = df['retweet_count'].apply(lambda x: str(x))
# print(df)
return df
if __name__ == '__main__':
print(iterateTweets())
I'm beginner to data frame and python can someone help me out?
CodePudding user response:
Use Index.repeat
with DataFrame.loc
for duplicated columns, DataFrame.pop
is for using and dropping column:
df = pd.read_csv(r'data_set_tweets.csv')
df['created_at'] = pd.to_datetime(df['created_at'])
df = df.loc[df.index.repeat(df.pop('retweet_count') 1)].reset_index(drop=True)
print (df)
created_at tweet
0 2021-07-29 02:40:00 Great Sunny day for Cricket at London
1 2021-07-29 02:40:00 Great Sunny day for Cricket at London
2 2021-07-29 02:40:00 Great Sunny day for Cricket at London
3 2021-07-29 02:40:00 Great Sunny day for Cricket at London
4 2021-07-29 10:40:00 Great Score put on by England batting
5 2021-07-29 11:50:00 England won the match
6 2021-07-29 11:50:00 England won the match
CodePudding user response:
Or use:
df = df.apply(lambda x: x.repeat(df['retweet_count'] 1)).reset_index(drop=True)
If you want to remove the retweet_count
column:
df = df.apply(lambda x: x.repeat(df['retweet_count'] 1)).reset_index(drop=True).drop('retweet_count', axis=1)
Or:
col = df.pop('retweet_count') 1
df = df.apply(lambda x: x.repeat(col)).reset_index(drop=True)
df
output:
created_at tweet
0 2021-07-29 02:40:00 Great Sunny day for Cricket at London
1 2021-07-29 02:40:00 Great Sunny day for Cricket at London
2 2021-07-29 02:40:00 Great Sunny day for Cricket at London
3 2021-07-29 02:40:00 Great Sunny day for Cricket at London
4 2021-07-29 10:40:00 Great Score put on by England batting
5 2021-07-29 11:50:00 England won the match
6 2021-07-29 11:50:00 England won the match
Or use loc
with enumerate
:
df.loc[sum([[i] * (v 1) for i, v in enumerate(df['retweet_count'])], [])].reset_index(drop=True)