Home > other >  Identify pairs of events then calculate the time elapsed between events
Identify pairs of events then calculate the time elapsed between events

Time:10-05

I have a dataframe with messages sent and received. I want to calculate the time it took for someone to reply to the message.

import numpy as np
import pandas as pd
from datetime import datetime

df = pd.DataFrame({'sent':[78,18,94,55,68,57,78,8],
                   'received':[18,78,35,14,57,68,57,17],
                   'time':['2017-01-01T12','2017-01-01T13',
                           '2017-01-02T12','2017-02-01T13',
                           '2017-01-01T14','2017-01-01T15',
                           '2017-01-01T16','2017-01-01T17']})
df['time'] = pd.to_datetime(pd.Series(df['time']))

The method I thought of using was identifying pairs, so if sent =A and received =B, then there should be another entry with sent=B and received =A.

df["pairs"] = df.apply(lambda x: not df[(df["sent"] == x["received"]) & (df["received"] == x["sent"]) & (df.index != x.name)].empty, axis=1)

Then once I identify the pairs, I can calculate the time it took to respond with

sent_time = datetime.strptime('2017-01-01 12:00:00', fmt)
recieved_time = datetime.strptime('2017-01-01 13:00:00', fmt)

if sent_time > recieved_time:
    td = sent_time - recieved_time
else:
    td = recieved_time - sent_time

time = int(round(td.total_seconds() / 60))

I feel like I can do these separately, but I can't seem to put them together.


EDIT

So as for the output, I guess I would need a separate dataframe that would list the senders and the time it took for someone to reply to the e-mail.

so with the example,

the message was sent by 78 and it took 60 min to respond. then a message was sent by 68 and it took 60 min to respond

Sender time_to_respond
78 60
68 60

CodePudding user response:

#Sort row values to create unique group
df[['s','t']] = np.sort(df[['sent','received']], axis=1)

#Subset duplicated groups

    s = df[df.duplicated(subset=['s','t'], keep=False)]
    
    #Compute time difference between duplicated groups, drop duplicated rows and unwanted columns
    s=s.assign(time_to_respond=s.groupby(['s','t'])['time'].transform(lambda x:x.diff().bfill().dt.total_seconds()/60)).drop_duplicates(subset=['s','t'])[['sent','time_to_respond']]



   sent  time_to_respond
0    78             60.0
4    68             60.0

CodePudding user response:

A proposition with pandas.merge :

df =  (
        df.merge(df, left_on='sent',right_on='received',how='left')
          .assign(time_to_respond= lambda x: (x['time_y'] - x['time_x']).dt.total_seconds()/60)
      )


out = (
        df.loc[(df['time_to_respond'].gt(0)), ['sent_x', 'time_to_respond']]
          .rename(columns={'sent_x': 'sender'})
          .reset_index(drop=True)
      )

# Output :

print(out)

   sender  time_to_respond
0      78             60.0
1      68             60.0
2      57             60.0
  • Related