Python Pandas - Difference between groupby keys with repeated valyes-CodePudding

I have some data with dates of sales to my clients. The data looks like this:

	Cod client	Items	Date
0	100	1	2022/01/01
1	100	7	2022/01/01
2	100	2	2022/02/01
3	101	5	2022/01/01
4	101	8	2022/02/01
5	101	10	2022/02/01
6	101	2	2022/04/01
7	101	2	2022/04/01
8	102	4	2022/02/01
9	102	10	2022/03/01

What I'm trying to acomplish is to calculate the differences beetween dates for each client: grouped first by "Cod client" and after by "Date" (because of the duplicates)

The expected result is like:

	Cod client	Items	Date	Date diff	Explain
0	100	1	2022/01/01	NaT	First date for client 100
1	100	7	2022/01/01	NaT	...repeat above
2	100	2	2022/02/01	31	Diff from first date 2022/01/01
3	101	5	2022/01/01	NaT	Fist date for client 101
4	101	8	2022/02/01	31	Diff from first date 2022/01/01
5	101	10	2022/02/01	31	...repeat above
6	101	2	2022/04/01	59	Diff from previous date 2022/02/01
7	101	2	2022/04/01	59	...repeat above
8	102	4	2022/02/01	NaT	First date for client 102
9	102	10	2022/03/01	28	Diff from first date 2022/02/01

I already tried doing df["Date diff"] = df.groupby("Cod client")["Date"].diff() but it considers the repeated dates and return zeroes for then

I appreciate for help!

CodePudding user response：

Another way to do this, with transform:

import pandas as pd
# data saved as .csv
df = pd.read_csv("Data.csv", header=0, parse_dates=True)
# convert Date column to correct date.
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")
# new column!
df["Date diff"] = df.sort_values("Date").groupby("Cod client")["Date"].transform(lambda x: x.diff().replace("0 days", pd.NaT).ffill())

CodePudding user response：

IIUC you can combine several groupby operations:

# ensure datetime
df['Date'] = pd.to_datetime(df['Date'])

# set up group
g = df.groupby('Cod client')

# identify duplicated dates per group
m = g['Date'].apply(pd.Series.duplicated)

# compute the diff, mask and ffill
df['Date diff'] = g['Date'].diff().mask(m).groupby(df['Cod client']).ffill()

output:

   Cod client  Items       Date Date diff
0         100      1 2022-01-01       NaT
1         100      7 2022-01-01       NaT
2         100      2 2022-02-01   31 days
3         101      5 2022-01-01       NaT
4         101      8 2022-02-01   31 days
5         101     10 2022-02-01   31 days
6         101      2 2022-04-01   59 days
7         101      2 2022-04-01   59 days
8         102      4 2022-02-01       NaT
9         102     10 2022-03-01   28 days