Home > OS >  difference between multiple dates based on id column in a Pandas dataframe
difference between multiple dates based on id column in a Pandas dataframe

Time:04-15

My practical case is the following, in a store I would like to know visit period (in day) between the 1st visit and the 2nd visit, the 2nd and the 3rd,...

I have a python dataset with 2 columns (visit IDs for each customer and a date of visit)

data = {'Id': ['A', 'B','A','B','A','A'],
'Date': ['01/03/2022', '03/03/2022', '05/03/2022', '07/03/2022', '09/03/2022','11/03/2022']
}

enter image description here

My question : How many days are there between the 1st visit and the 2nd visit for customers who have come 4 times? same question between the 2nd and the 3rd visit...

CodePudding user response:

You can first cast them to pd.Timestamp objects by

df['Date'] = pd.to_datetime(df['Date'])

Then, you can simply:

df['Date Since Last Visit'] = df['Date'] - df['Date'].shift(1)

Which will give you the timedelta objects that you can see how many days it has been. The .shift() moves the date column down by one so you can easily compute the difference between the two pd.Timestamp objects.

Your output will be:

    Id  Date    Date Since Last Visit
0   A   2022-01-03  NaT
1   B   2022-03-03  59 days
2   A   2022-05-03  61 days
3   B   2022-07-03  61 days
4   A   2022-09-03  62 days
5   A   2022-11-03  61 days

NaT for the first since there is no previous entry to compare.

CodePudding user response:

The output you expect is unclear, but let's compute a 2D table with the customers as index and the visits as columns:

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
g = df.groupby('Id')['Date']

df2 = (df
 .assign(visit=g.cumcount().add(1),
         nb_visits=g.transform('count'),
         diff=g.diff()
         )
 .query('nb_visits >= 4')
 .pivot(index='Id', columns='visit', values='diff')
)

output:

visit   1      2      3      4
Id                            
A     NaT 4 days 4 days 2 days
  • Related