Home > Software engineering >  Loop text data based on column value in data frame in python
Loop text data based on column value in data frame in python

Time:10-01

I have a dataset called data_set_tweets.csv as below

created_at,tweet,retweet_count
7/29/2021 2:40,Great Sunny day for Cricket at London,3
7/29/2021 10:40,Great Score put on by England batting,0
7/29/2021 11:50,England won the match,1

And what I was trying to do is to get the below output in to a data frame.
Which means I want to iterate the text in the tweet column based on the retweet_count value with same created_at values on that particular tweet
Below is the expected output for my dataset

created_at,tweet
7/29/2021 2:40,Great Sunny day for Cricket at London
7/29/2021 2:40,Great Sunny day for Cricket at London
7/29/2021 2:40,Great Sunny day for Cricket at London
7/29/2021 2:40,Great Sunny day for Cricket at London
7/29/2021 10:40,Great Score put on by England batting
7/29/2021 11:50,England won the match
7/29/2021 11:50,England won the match


Below is how I started my approach
import pandas as pd

def iterateTweets():
tweets = pd.read_csv(r'data_set_tweets.csv')
df = pd.DataFrame(tweets, columns=['created_at', 'tweet', 'retweet_count'])
df['created_at'] = pd.to_datetime(df['created_at'])
df['tweet'] = df['tweet'].apply(lambda x: str(x))
df['retweet_count'] = df['retweet_count'].apply(lambda x: str(x))

# print(df)
return df

if __name__ == '__main__':

print(iterateTweets())

I'm beginner to data frame and python can someone help me out?

CodePudding user response:

Use Index.repeat with DataFrame.loc for duplicated columns, DataFrame.pop is for using and dropping column:

df = pd.read_csv(r'data_set_tweets.csv')

df['created_at'] = pd.to_datetime(df['created_at'])
df = df.loc[df.index.repeat(df.pop('retweet_count')   1)].reset_index(drop=True)
print (df)
           created_at                                  tweet
0 2021-07-29 02:40:00  Great Sunny day for Cricket at London
1 2021-07-29 02:40:00  Great Sunny day for Cricket at London
2 2021-07-29 02:40:00  Great Sunny day for Cricket at London
3 2021-07-29 02:40:00  Great Sunny day for Cricket at London
4 2021-07-29 10:40:00  Great Score put on by England batting
5 2021-07-29 11:50:00                  England won the match
6 2021-07-29 11:50:00                  England won the match

CodePudding user response:

Or use:

df = df.apply(lambda x: x.repeat(df['retweet_count']   1)).reset_index(drop=True)

If you want to remove the retweet_count column:

df = df.apply(lambda x: x.repeat(df['retweet_count']   1)).reset_index(drop=True).drop('retweet_count', axis=1)

Or:

col = df.pop('retweet_count')   1
df = df.apply(lambda x: x.repeat(col)).reset_index(drop=True)

df output:

           created_at                                  tweet
0 2021-07-29 02:40:00  Great Sunny day for Cricket at London
1 2021-07-29 02:40:00  Great Sunny day for Cricket at London
2 2021-07-29 02:40:00  Great Sunny day for Cricket at London
3 2021-07-29 02:40:00  Great Sunny day for Cricket at London
4 2021-07-29 10:40:00  Great Score put on by England batting
5 2021-07-29 11:50:00                  England won the match
6 2021-07-29 11:50:00                  England won the match

Or use loc with enumerate:

df.loc[sum([[i] * (v   1) for i, v in enumerate(df['retweet_count'])], [])].reset_index(drop=True)
  • Related