I have a dataframe (df) with the following structure:
retweet_datetime | tweet_id | tweet_datetime |
---|---|---|
2020-04-24 03:33:15 | 85053699 | 2020-04-24 02:28:22 |
2020-04-24 02:43:35 | 85053699 | 2020-04-24 02:28:22 |
2020-04-18 04:24:03 | 86095361 | 2020-04-18 00:06:01 |
2020-04-18 00:19:08 | 86095361 | 2020-04-18 00:06:01 |
2020-04-18 00:18:38 | 86095361 | 2020-04-18 00:06:01 |
2020-04-18 00:07:08 | 86095361 | 2020-04-18 00:06:01 |
The retweet_datetime is sorted from latest to newest retweets.
I'd like to create two new columns as follows:
tweet_lifetime1
: the difference between the last retweet time and the first retweet time, i.e., for each tweet_id: last retweet_datetime - first retweet_datetimetweet_lifetime2
: the difference between the last retweet time and tweet creation time (tweet_datetime
)
UPDATE:
For example, for the tweet id: "86095361":
tweet_lifetime1
=2020-04-18 04:24:03 - 2020-04-18 00:07:08
(04:16:55
)tweet_lifetime2
=2020-04-18 04:24:03 - 2020-04-18 00:06:01
(04:18:02
)
The expected output df:
retweet_datetime | tweet_id | tweet_datetime | lifetime1 | lifetime2 |
---|---|---|---|---|
2020-04-24 03:33:15 | 85053699 | 2020-04-24 02:28:22 | 00:49:40 | 01:04:53 |
2020-04-18 04:24:03 | 86095361 | 2020-04-18 00:06:01 | 04:16:55 | 04:18:02 |
I've seen several similar posts, but they mostly subtract consecutive rows. For example, I can subtract the time difference between each retweet_datetimes for each tweet id as follows:
df2 = df.assign(delta = df.groupby('tweet_id')['retweet_datetime'].diff())
Can someone help me with this? Thank you,
CodePudding user response:
Groupby and agg with first
to select the first row per group, then groupby and agg retweet_datetime
with last
to select last value per group. Then join
aggregated dataframes and calculate lifetime1
and lifetime2
by subtracting relevant columns
g = df.groupby('tweet_id')
s = g.first().join(g['retweet_datetime'].agg(['last']))
s['lifetime1'] = s['retweet_datetime'] - s.pop('last')
s['lifetime2'] = s['retweet_datetime'] - s['tweet_datetime']
retweet_datetime tweet_datetime lifetime1 lifetime2
tweet_id
85053699 2020-04-24 03:33:15 2020-04-24 02:28:22 0 days 00:49:40 0 days 01:04:53
86095361 2020-04-18 04:24:03 2020-04-18 00:06:01 0 days 04:16:55 0 days 04:18:02
CodePudding user response:
Use named aggregation with subtract column with Series.sub
, DataFrame.pop
is used for drop column tmp
after processing:
df1 = (df.groupby('tweet_id', as_index=False)
.agg(retweet_datetime=('retweet_datetime','first'),
tmp = ('retweet_datetime','last'),
tweet_datetime = ('tweet_datetime','last')))
df1['lifetime1'] = df1['retweet_datetime'].sub(df1.pop('tmp'))
df1['lifetime2'] = df1['retweet_datetime'].sub(df1['tweet_datetime'])
print (df1)
tweet_id retweet_datetime tweet_datetime lifetime1 \
0 85053699 2020-04-24 03:33:15 2020-04-24 02:28:22 0 days 00:49:40
1 86095361 2020-04-18 04:24:03 2020-04-18 00:06:01 0 days 04:16:55
lifetime2
0 0 days 01:04:53
1 0 days 04:18:02
If need format HH:MM:SS
use:
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
df1['lifetime1'] = df1['retweet_datetime'].sub(df1.pop('tmp')).apply(f)
df1['lifetime2'] = df1['retweet_datetime'].sub(df1['tweet_datetime']).apply(f)
print (df1)
tweet_id retweet_datetime tweet_datetime lifetime1 lifetime2
0 85053699 2020-04-24 03:33:15 2020-04-24 02:28:22 00:49:40 01:04:53
1 86095361 2020-04-18 04:24:03 2020-04-18 00:06:01 04:16:55 04:18:02