My practical case is the following, in a store I would like to know visit period (in day) between the 1st visit and the 2nd visit, the 2nd and the 3rd,...
I have a python dataset with 2 columns (visit IDs for each customer and a date of visit)
data = {'Id': ['A', 'B','A','B','A','A'],
'Date': ['01/03/2022', '03/03/2022', '05/03/2022', '07/03/2022', '09/03/2022','11/03/2022']
}
My question : How many days are there between the 1st visit and the 2nd visit for customers who have come 4 times? same question between the 2nd and the 3rd visit...
CodePudding user response:
You can first cast them to pd.Timestamp
objects by
df['Date'] = pd.to_datetime(df['Date'])
Then, you can simply:
df['Date Since Last Visit'] = df['Date'] - df['Date'].shift(1)
Which will give you the timedelta objects that you can see how many days it has been. The .shift()
moves the date column down by one so you can easily compute the difference between the two pd.Timestamp
objects.
Your output will be:
Id Date Date Since Last Visit
0 A 2022-01-03 NaT
1 B 2022-03-03 59 days
2 A 2022-05-03 61 days
3 B 2022-07-03 61 days
4 A 2022-09-03 62 days
5 A 2022-11-03 61 days
NaT
for the first since there is no previous entry to compare.
CodePudding user response:
The output you expect is unclear, but let's compute a 2D table with the customers as index and the visits as columns:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
g = df.groupby('Id')['Date']
df2 = (df
.assign(visit=g.cumcount().add(1),
nb_visits=g.transform('count'),
diff=g.diff()
)
.query('nb_visits >= 4')
.pivot(index='Id', columns='visit', values='diff')
)
output:
visit 1 2 3 4
Id
A NaT 4 days 4 days 2 days