Home > database >  pandas with assign loc & pd.Timedelta --> mean
pandas with assign loc & pd.Timedelta --> mean

Time:04-19

I have a dataframe with 3 columns (id, date1,date2)

data = [['C', '05/06/2021','07/09/2021'],['A', '15/04/2021','08/09/2021'],['A','15/10/2021','09/12/2021'],['C', '03/07/2021','10/09/2021'],['C', '13/07/2021','11/09/2021'],['C', '25/10/2021','12/12/2021'],['C', '26/09/2021','07/12/2021'],['C', '10/08/2021','07/12/2021'],['C', '28/07/2021','13/12/2021'],['A', '15/05/2021','13/12/2021'], ['C', '13/06/2021','13/12/2021'],['A', '17/05/2021','13/12/2021'],['C', '27/06/2021','13/12/2021'], ['B', '18/06/2021','13/12/2021']]

df_test = pd.DataFrame(data, columns = ['id', 'date1', 'date2'])
df_test['date1'] = pd.to_datetime(df_test['date1'],dayfirst=True)
df_test['date2'] = pd.to_datetime(df_test['date2'],dayfirst=True)

I want to calculate difference between date2-date1 and calculate means based on days filter >=100 I have 2 methods the first one works but not the second doesn't... How could i fix it ?

The first one works

df_final=(df_test
.sort_values(by='id')
.assign(diffe=df_test['date2']- df_test['date1']
)   
)

and

test=df_final.loc[df_final['diffe']>=pd.Timedelta(100, 'D')]
test['diffe'].mean()

Second method

df_final=(df_test
    .sort_values(by='id') # Classe les numeros sont dans l'ordre
    .assign(diffe=df_test['date2']- df_test['date1']
    )
    .loc[df_reservation_delay['diffe']>=pd.Timedelta(100, 'D')]
    .mean()
    )

I have an error (KeyError: 'diffe') ? Do you have an idea ?

Have a nice day

CodePudding user response:

You need to use a callable as your column does not exist yet.

Also, better explicitly provide the column name in loc to avoid a FutureWarning

FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.

df_final=(df_test
    .sort_values(by='id') # Classe les numeros sont dans l'ordre
    .assign(diffe=df_test['date2']- df_test['date1']
    )
    .loc[lambda d: d['diffe']>=pd.Timedelta(100, 'D'), 'diffe']
    .mean()
    )

output: Timedelta('169 days 09:00:00')

  • Related