comparing rows data frame | shift and apply functions throwing exception-CodePudding

I am trying to derive a mean value for the average duration spent in a specific status by ID. For this I first sort my data frame by ID and date, and with the apply and shift function trying to deduct the date of row[i 1] - row[i] - given row[i 1] - row[i] are for the same ID.

I get the following exception: AttributeError: 'int' object has no attribute 'shift'

Below a code for simulation:

import datetime
from datetime import datetime
today =  datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame_ordered = frame.sort_values(['id','date'], ascending=True)
frame_ordered['duration'] = frame_ordered.apply(lambda x: x['date'].shift(-1) - x['date'] if x['id'] == x['id'].shift(-1) else today - x['date'], axis=1)

Can anyone please advise how to solve the last line with the lambda function?

CodePudding user response：

I was not able to get it done with lambda. You can try like this:

import datetime
today =  datetime.datetime.today() # you want it as real date, not string
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame['date'] = pd.to_datetime(frame['date']) #convert date column to datetime
frame_ordered = frame.sort_values(['id','date'], ascending=True)

#add column with shifted date values
frame_ordered['shifted'] = frame_ordered['date'].shift(-1)

# mask where the next row has same id as current one
mask = frame_ordered['id'] == frame_ordered['id'].shift(-1)
print(mask)

# subtract date and shifted date if mask is true, otherwise subtract date from today. ".dt.days" only displays the days, not necessary 
frame_ordered['duration'] = np.where(mask, (frame_ordered['shifted']-frame_ordered['date']).dt.days, (today-frame_ordered['date']).dt.days)

#delete shifted date column if you want
frame_ordered = frame_ordered.drop('shifted', axis=1)

print(frame_ordered)

Output:

#mask
0    False
4    False
2    False
3     True
1    False
Name: id, dtype: bool

#frame_ordered
     id  status       date  duration
0  1245       1 2022-07-01      25.0
4  1248       6 2022-01-03     204.0
2  2345       4 2022-04-20      97.0
3  4556       5 2022-02-02      38.0
1  4556       2 2022-03-12     136.0

CodePudding user response：

I think that the values were not interpreted as pandas Timestamps. With the right conversion it should be easy though:

import datetime
from datetime import datetime
today =  datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame['date'] = pd.to_datetime(frame['date'])
frame_ordered = frame.sort_values(['id','date'], ascending=True)
frame_ordered['shifted'] = frame_ordered['date'].shift(1)
frame_ordered['Difference'] = frame_ordered['date']-frame_ordered['date'].shift(1)
print(frame_ordered)

which prints out

     id  status       date    shifted Difference
0  1245       1 2022-07-01        NaT        NaT
4  1248       6 2022-01-03 2022-07-01  -179 days
2  2345       4 2022-04-20 2022-01-03   107 days
3  4556       5 2022-02-02 2022-04-20   -77 days
1  4556       2 2022-03-12 2022-02-02    38 days