I have a pandas dataframe with a date column and a id column. I would like to return the number of occurences the the id of each line, in the past 14 days prior to the corresponding date of each line. That means, I would like to return "1, 2, 1, 2, 3, 4, 1". How can I do this? Performance is important since the dataframe has a len of 200,000 rows or so. Thanks !
date | id |
---|---|
2021-01-01 | 1 |
2021-01-04 | 1 |
2021-01-05 | 2 |
2021-01-06 | 2 |
2021-01-07 | 1 |
2021-01-08 | 1 |
2021-01-28 | 1 |
CodePudding user response:
Assuming the input is sorted by date, you can use a GroupBy.rolling
approach:
# only required if date is not datetime type
df['date'] = pd.to_datetime(df['date'])
(df.assign(count=1)
.set_index('date')
.groupby('id')
.rolling('14d')['count'].sum()
.sort_index(level='date').reset_index() #optional if order is not important
)
output:
id date count
0 1 2021-01-01 1.0
1 1 2021-01-04 2.0
2 2 2021-01-05 1.0
3 2 2021-01-06 2.0
4 1 2021-01-07 3.0
5 1 2021-01-08 4.0
6 1 2021-01-28 1.0
CodePudding user response:
I am not sure whether this is the best idea or not, but the code below is what I have come up with:
from datetime import timedelta
df["date"] = pd.to_datetime(df["date"])
newColumn = []
for index, row in df.iterrows():
endDate = row["date"]
startDate = endDate - timedelta(days=14)
id = row["id"]
summation = df[(df["date"] >= startDate) & (df["date"] <= endDate) & (df["id"] == id)]["id"].count()
newColumn.append(summation)
df["check_column"] = newColumn
df
Output
date | id | check_column | |
---|---|---|---|
0 | 2021-01-01 00:00:00 | 1 | 1 |
1 | 2021-01-04 00:00:00 | 1 | 2 |
2 | 2021-01-05 00:00:00 | 2 | 1 |
3 | 2021-01-06 00:00:00 | 2 | 2 |
4 | 2021-01-07 00:00:00 | 1 | 3 |
5 | 2021-01-08 00:00:00 | 1 | 4 |
6 | 2021-01-28 00:00:00 | 1 | 1 |
Explanation
In this approach, I have used iterrows
in order to loop over the dataframe's rows. Additionally, I have used timedelta
in order to subtract 14 days from the date column.
CodePudding user response:
if the date column is datetime dtype (if not convert it) , here is another way with series.diff
and groupby.cumcount
s = df['date'].diff().dt.days.fillna(0).le(14)
df['expected_count'] = df.groupby([s.ne(s.shift()).cumsum(),'id'])['id'].cumcount() 1
print(df)
date id expected_count
0 2021-01-01 1 1
1 2021-01-04 1 2
2 2021-01-05 2 1
3 2021-01-06 2 2
4 2021-01-07 1 3
5 2021-01-08 1 4
6 2021-01-28 1 1