I have a table in which the transaction data of customers are stored. The purchase date is always given at the first of the month. If a customer has bought more than once during a month, both purchase invoices are issued on the same date.
Date | CustomerID |
---|---|
01.02.20 | 1 |
01.03.20 | 1 |
01.03.20 | 1 |
01.03.20 | 1 |
For my further calculations, I would now like to remove the "duplicates" by increasing the date by one day for repeated purchases in a month. How can I do this to get the desired result with python?
Date | CustomerID |
---|---|
01.02.20 | 1 |
01.03.20 | 1 |
02.03.20 | 1 |
03.03.20 | 1 |
CodePudding user response:
A way to do it is to group by the dates and then convert them to a date range based on the number of occurrences of a certain customer ID.
First, we'd like the dates to be represented by pandas internally as actual dates and not strings.
df["Date"] = pd.to_datetime(df["Date"], format="%d.%m.%y")
Next, we group the data by customer ID and dates. In your case there's only the year 2020 so grouping by month only is alright but we'd like to be future proof so we shall also group by year. dt
refers to "datetime" and fetches the relevant info for us.
gb = df.groupby(["CustomerID", df.Date.dt.year, df.Date.dt.month])
Then, we can use some pandas magic proper. For each group, we get the first date which will always be the first day of a given month. We create a range of days, starting from that day, that is proportional to the counts of observations in that group and voilà.
idx = gb["Date"].transform(lambda x : pd.date_range(x.iloc[0],
periods=x.count(),
freq="D"))
idx
looks like this :
0 2020-02-01
1 2020-03-01
2 2020-03-02
3 2020-03-03
Name: Date, dtype: datetime64[ns]
Finally, we know there are as many values in newly created idx
as there are in the original dataframe, meaning that we can either use this variable as and index or as new column.
(Pick one of the two depending on your needs)
df.index = idx # as index
df["Date_fixed"] = idx # as column
N.B: This solution might present issues in case there are more customer ID's for a given month than days in the month. So you'll want to be careful with that =)