How can I make a new column in dataframe with a group label based on 7 day interval from start time-CodePudding

I have been looking online for a solution for this but I couldn't find one. Maybe it's too straightforward and I am not recognizing the solution but here is my question. I have a data frame that has rows representing transactions that took place on different days. The starting date of the period is 2022-01-01. I want to make a new column that would contain labels for which interval the transactions belong to. I want to create intervals of 7 days. So if the date of the transaction falls into the period of the first 7 days since the start date then those transactions would have the label "interval 1" if they fall in the next 7-day period then they would have the label "interval 2". Here is a dataframe as an example

# simulated data for stack overflow
data = {'dates':pd.date_range(start="2022-01-01",end="2022-01-21"),
       'values':rand(21)}
df_ex = pd.DataFrame(data)

I would like something like this

please note though that the actual date column does not have unique days. Some days have multiple transactions. I have tried some other stack overflow answers but they were not answering my specific question. I tried using some groupby functions and .diff() to no avail

CodePudding user response：

You can convert to_datetime, subtract the first date, get the number of days, and use the floordiv 1 to build your string:

s = pd.to_datetime(df_ex['dates'])
df_ex['interval'] = ('interval '
                      s.sub(s.iloc[0]).dt.days
                     .floordiv(7).add(1).astype(str)
                    )

NB. if the dates are not ordered, use s.min() as a reference (not s.iloc[0]). This method should work well with missing and duplicated dates.

output:

        dates    values    interval
0  2022-01-01  0.792507  interval 1
1  2022-01-02  0.485413  interval 1
2  2022-01-03  0.862531  interval 1
3  2022-01-04  0.216753  interval 1
4  2022-01-05  0.871699  interval 1
5  2022-01-06  0.723986  interval 1
6  2022-01-07  0.907056  interval 1
7  2022-01-08  0.618148  interval 2
8  2022-01-09  0.173755  interval 2
9  2022-01-10  0.218576  interval 2
10 2022-01-11  0.510574  interval 2
11 2022-01-12  0.832323  interval 2
12 2022-01-13  0.321287  interval 2
13 2022-01-14  0.721609  interval 2
14 2022-01-15  0.472324  interval 3
15 2022-01-16  0.492423  interval 3
16 2022-01-17  0.061945  interval 3
17 2022-01-18  0.161257  interval 3
18 2022-01-19  0.148596  interval 3
19 2022-01-20  0.047933  interval 3
20 2022-01-21  0.078941  interval 3

CodePudding user response：

use a function

jump = 7
df['dates'] = pd.to_datetime(df['dates'])
func = lambda x: int((x - df['dates'][0]).days) // jump   1
df['week'] = df['dates'].apply(func)

The logic is that from the initial date I am taking the difference of each day and dividing it by the week size which is 7. This will give me the number of weeks for each row. I am adding 1 because it should start from 1 according to your scenario.