I have been looking online for a solution for this but I couldn't find one. Maybe it's too straightforward and I am not recognizing the solution but here is my question. I have a data frame that has rows representing transactions that took place on different days. The starting date of the period is 2022-01-01. I want to make a new column that would contain labels for which interval the transactions belong to. I want to create intervals of 7 days. So if the date of the transaction falls into the period of the first 7 days since the start date then those transactions would have the label "interval 1" if they fall in the next 7-day period then they would have the label "interval 2". Here is a dataframe as an example
# simulated data for stack overflow
data = {'dates':pd.date_range(start="2022-01-01",end="2022-01-21"),
'values':rand(21)}
df_ex = pd.DataFrame(data)
I would like something like this
please note though that the actual date column does not have unique days. Some days have multiple transactions. I have tried some other stack overflow answers but they were not answering my specific question. I tried using some groupby functions and .diff() to no avail
CodePudding user response:
You can convert to_datetime
, subtract the first date, get the number of days
, and use the floordiv
1 to build your string:
s = pd.to_datetime(df_ex['dates'])
df_ex['interval'] = ('interval '
s.sub(s.iloc[0]).dt.days
.floordiv(7).add(1).astype(str)
)
NB. if the dates are not ordered, use s.min()
as a reference (not s.iloc[0]
). This method should work well with missing and duplicated dates.
output:
dates values interval
0 2022-01-01 0.792507 interval 1
1 2022-01-02 0.485413 interval 1
2 2022-01-03 0.862531 interval 1
3 2022-01-04 0.216753 interval 1
4 2022-01-05 0.871699 interval 1
5 2022-01-06 0.723986 interval 1
6 2022-01-07 0.907056 interval 1
7 2022-01-08 0.618148 interval 2
8 2022-01-09 0.173755 interval 2
9 2022-01-10 0.218576 interval 2
10 2022-01-11 0.510574 interval 2
11 2022-01-12 0.832323 interval 2
12 2022-01-13 0.321287 interval 2
13 2022-01-14 0.721609 interval 2
14 2022-01-15 0.472324 interval 3
15 2022-01-16 0.492423 interval 3
16 2022-01-17 0.061945 interval 3
17 2022-01-18 0.161257 interval 3
18 2022-01-19 0.148596 interval 3
19 2022-01-20 0.047933 interval 3
20 2022-01-21 0.078941 interval 3
CodePudding user response:
use a function
jump = 7
df['dates'] = pd.to_datetime(df['dates'])
func = lambda x: int((x - df['dates'][0]).days) // jump 1
df['week'] = df['dates'].apply(func)
The logic is that from the initial date I am taking the difference of each day and dividing it by the week size which is 7. This will give me the number of weeks for each row. I am adding 1 because it should start from 1 according to your scenario.