Home > Mobile >  How to substract two dates based on filter of two other columns
How to substract two dates based on filter of two other columns

Time:10-18

I am new in Python and I am struggling to reshape my dataFrame.

Here attached is a sample of the data I have. data sample

For a particular client (contact_id), I want to add an new date column that actually substracts the DTHR_OPERATION date for a 'TYPE_OPER_VALIDATION = 3' minus the DTHR_OPERATION date for a 'TYPE_OPER_VALIDATION = 1'.
If the 'TYPE_OPER_VALIDATION' is equal to 3 and that there is less than a hour difference between those two dates, I want to add a string such as 'connection' for example in the new column.

I have an issue "python Series' object has no attribute 'total_seconds" when I try to compare if the time difference is indeed minus or equal to an hour. I tried many solutions I found on Internet but I always seem to have a data type issue.

Here is my code snippet:

df_oper_one = merged_table.loc[(merged_table['TYPE_OPER_VALIDATION']==1),['contact_id','TYPE_OPER_VALIDATION','DTHR_OPERATION']]
df_oper_three = merged_table.loc[(merged_table['TYPE_OPER_VALIDATION']==3),['contact_id','TYPE_OPER_VALIDATION','DTHR_OPERATION']]

connection = []
for row in merged_table['contact_id']:
    if (df_validation.loc[(df_validation['TYPE_OPER_VALIDATION']==3)]) & ((pd.to_datetime(df_oper_three['DTHR_OPERATION'],format='%Y-%m-%d %H:%M:%S') - pd.to_datetime(df_oper_one['DTHR_OPERATION'],format='%Y-%m-%d %H:%M:%S').total_seconds()) <= 3600): connection.append('connection')
     # if diff_date.total_seconds() <= 3600: connection.append('connection')
    else:  connection.append('null')

merged_table['connection'] = pd.Series(connection)

CodePudding user response:

Hello Nicolas and welcome to Stack Overflow. Please remember to always include sample data to reproduce your issue. Here is sample data to reproduce part of your dataframe:

df = pd.DataFrame({'Id contact':['cf2e79bc-8cac-ec11-9840-000d3ab078e6']*12 ['865c5edf-c7ac-ec11-9840-000d3ab078e6']*10,
       'DTHR OPERATION':['11/10/2022 07:07', '11/10/2022 07:29', '11/10/2022 15:47', '11/10/2022 16:22', '11/10/2022 16:44', '11/10/2022 18:06', '12/10/2022 07:11', '12/10/2022 07:25', '12/10/2022 17:21', '12/10/2022 18:04', '13/10/2022 07:09', '13/10/2022 18:36', '14/09/2022 17:59', '15/09/2022 09:34', '15/09/2022 19:17', '16/09/2022 08:31', '16/09/2022 19:18', '17/09/2022 06:41', '17/09/2022 11:19', '17/09/2022 15:48', '17/09/2022 16:13', '17/09/2022 17:07'],
       'lastname':['BOUALAMI']*12 ['VERVOORT']*10,
       'TYPE_OPER_VALIDATION':[1, 3, 1, 3, 3, 3, 1, 3, 1, 3, 1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 3, 3]})

df['DTHR OPERATION'] = pd.to_datetime(df['DTHR OPERATION'])

I would recommend creating a new table to more easily accomplish your task:

df2 = pd.merge(df[['Id contact', 'DTHR OPERATION']][df['TYPE_OPER_VALIDATION']==3], df[['Id contact', 'DTHR OPERATION']][df['TYPE_OPER_VALIDATION']==1], on='Id contact', suffixes=('_type3','_type1'))

Then find the time difference:

df2['seconds'] = (df2['DTHR OPERATION_type3']-df2['DTHR OPERATION_type1']).dt.total_seconds()

Finally, flag connections of an hour or less:

df2['connection'] = np.where(df2['seconds']<=3600, 'yes', 'no')

Hope this helps!

CodePudding user response:

Thanks to your help I know have a df3['first_connection'] column that correctly flag them.

preview df3['first_connection']

Here is a preview of the result and the code snippet is just below :

df2 = df_support.merge(df_contact, left_on='bd_linkedcustomer', right_on='contact_id').merge(df_validation, left_on='bd_supportserialnumber',right_on='NUM_SERIE_SUPPORT')

df3 = pd.merge(df2[['contact_id', 'DTHR_OPERATION']][df2['TYPE_OPER_VALIDATION']==3], df2[['contact_id', 'DTHR_OPERATION']][df2['TYPE_OPER_VALIDATION']==1], on='contact_id', suffixes=('_type3','_type1'))

df3['seconds'] = (df3['DTHR_OPERATION_type3']-df3['DTHR_OPERATION_type1']).dt.total_seconds()

df3['first_connection'] = np.where((df3['seconds']<=3600) & (df3['seconds']>0), 'yes', 'no')

I repeated the process in a df4 where I wanted to flag the time difference where df2['TYPE_OPER_VALIDATION']==3 - df2['TYPE_OPER_VALIDATION']==3 to spot the second or more connection

Here is a preview of the result and the code snippet is just below :

preview df4['second_or_more_connection']

df4 = pd.merge(df2[['contact_id', 'DTHR_OPERATION']][df2['TYPE_OPER_VALIDATION']==3], df2[['contact_id', 'DTHR_OPERATION']][df2['TYPE_OPER_VALIDATION']==3], on='contact_id', suffixes=('_type3','_type3bis'))

df4['seconds_type3'] = (df4['DTHR_OPERATION_type3']-df4['DTHR_OPERATION_type3bis']).dt.total_seconds()

df4['second_or_more_connection'] = np.where((df4['seconds_type3']<=3600) & (df4['seconds_type3']>0), 'yes', 'no')

The thing is I now want to have a final connection column that paste the 'yes' value from the df3['first_connection'] column labeled as '1st connection', the 'yes' value from the df4['second_or_more_connection'] labeled as '2nd or more connection', the 1 value from the df2['TYPE_OPER_VALIDATION'] labeled as '1st journey' and the remaining empty value labeled as 'null'.

I struggle to do so as I am dealing with three different dataframes. I tried some nested for loops, some concatenation and so on but I don't know which option I should go for since I don't have expected results.

Also I end up having way too many values, I want to do an inner join at the end to only have the date field (df2[DTHR_OPERATION]) occuring once.

Can you help me out or guide me to do such data manipulation and transformation pls ? Many thanks for your support :)

  • Related