I am trying to derive a mean value for the average duration spent in a specific status by ID. For this I first sort my data frame by ID and date, and with the apply and shift function trying to deduct the date of row[i 1] - row[i] - given row[i 1] - row[i] are for the same ID.
I get the following exception: AttributeError: 'int' object has no attribute 'shift'
Below a code for simulation:
import datetime
from datetime import datetime
today = datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame_ordered = frame.sort_values(['id','date'], ascending=True)
frame_ordered['duration'] = frame_ordered.apply(lambda x: x['date'].shift(-1) - x['date'] if x['id'] == x['id'].shift(-1) else today - x['date'], axis=1)
Can anyone please advise how to solve the last line with the lambda function?
CodePudding user response:
I was not able to get it done with lambda. You can try like this:
import datetime
today = datetime.datetime.today() # you want it as real date, not string
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame['date'] = pd.to_datetime(frame['date']) #convert date column to datetime
frame_ordered = frame.sort_values(['id','date'], ascending=True)
#add column with shifted date values
frame_ordered['shifted'] = frame_ordered['date'].shift(-1)
# mask where the next row has same id as current one
mask = frame_ordered['id'] == frame_ordered['id'].shift(-1)
print(mask)
# subtract date and shifted date if mask is true, otherwise subtract date from today. ".dt.days" only displays the days, not necessary
frame_ordered['duration'] = np.where(mask, (frame_ordered['shifted']-frame_ordered['date']).dt.days, (today-frame_ordered['date']).dt.days)
#delete shifted date column if you want
frame_ordered = frame_ordered.drop('shifted', axis=1)
print(frame_ordered)
Output:
#mask
0 False
4 False
2 False
3 True
1 False
Name: id, dtype: bool
#frame_ordered
id status date duration
0 1245 1 2022-07-01 25.0
4 1248 6 2022-01-03 204.0
2 2345 4 2022-04-20 97.0
3 4556 5 2022-02-02 38.0
1 4556 2 2022-03-12 136.0
CodePudding user response:
I think that the values were not interpreted as pandas Timestamps. With the right conversion it should be easy though:
import datetime
from datetime import datetime
today = datetime.today().strftime('%Y-%m-%d')
frame = pd.DataFrame({'id': [1245, 4556, 2345, 4556, 1248],'status': [1,2,4,5,6], 'date': ['2022-07-01', '2022-03-12', '2022-04-20', '2022-02-02', '2022-01-03']})
frame['date'] = pd.to_datetime(frame['date'])
frame_ordered = frame.sort_values(['id','date'], ascending=True)
frame_ordered['shifted'] = frame_ordered['date'].shift(1)
frame_ordered['Difference'] = frame_ordered['date']-frame_ordered['date'].shift(1)
print(frame_ordered)
which prints out
id status date shifted Difference
0 1245 1 2022-07-01 NaT NaT
4 1248 6 2022-01-03 2022-07-01 -179 days
2 2345 4 2022-04-20 2022-01-03 107 days
3 4556 5 2022-02-02 2022-04-20 -77 days
1 4556 2 2022-03-12 2022-02-02 38 days