Count number of occurences in past 14 days of certain value-CodePudding

I have a pandas dataframe with a date column and a id column. I would like to return the number of occurences the the id of each line, in the past 14 days prior to the corresponding date of each line. That means, I would like to return "1, 2, 1, 2, 3, 4, 1". How can I do this? Performance is important since the dataframe has a len of 200,000 rows or so. Thanks !

date	id
2021-01-01	1
2021-01-04	1
2021-01-05	2
2021-01-06	2
2021-01-07	1
2021-01-08	1
2021-01-28	1

CodePudding user response：

Assuming the input is sorted by date, you can use a GroupBy.rolling approach:

# only required if date is not datetime type
df['date'] = pd.to_datetime(df['date'])

(df.assign(count=1)
   .set_index('date')
   .groupby('id')
   .rolling('14d')['count'].sum()
   .sort_index(level='date').reset_index() #optional if order is not important
)

output:

   id       date  count
0   1 2021-01-01    1.0
1   1 2021-01-04    2.0
2   2 2021-01-05    1.0
3   2 2021-01-06    2.0
4   1 2021-01-07    3.0
5   1 2021-01-08    4.0
6   1 2021-01-28    1.0

CodePudding user response：

I am not sure whether this is the best idea or not, but the code below is what I have come up with:

from datetime import timedelta
df["date"] = pd.to_datetime(df["date"])
newColumn = []
for index, row in df.iterrows():
  endDate = row["date"]
  startDate = endDate - timedelta(days=14)
  id = row["id"]
  summation = df[(df["date"] >= startDate) & (df["date"] <= endDate) & (df["id"] == id)]["id"].count()
  newColumn.append(summation)
df["check_column"] = newColumn
df

Output

	date	id	check_column
0	2021-01-01 00:00:00	1	1
1	2021-01-04 00:00:00	1	2
2	2021-01-05 00:00:00	2	1
3	2021-01-06 00:00:00	2	2
4	2021-01-07 00:00:00	1	3
5	2021-01-08 00:00:00	1	4
6	2021-01-28 00:00:00	1	1

Explanation

In this approach, I have used iterrows in order to loop over the dataframe's rows. Additionally, I have used timedelta in order to subtract 14 days from the date column.

CodePudding user response：

if the date column is datetime dtype (if not convert it) , here is another way with series.diff and groupby.cumcount

s = df['date'].diff().dt.days.fillna(0).le(14)
df['expected_count'] = df.groupby([s.ne(s.shift()).cumsum(),'id'])['id'].cumcount() 1

print(df)

        date  id     expected_count
0 2021-01-01   1                  1
1 2021-01-04   1                  2
2 2021-01-05   2                  1
3 2021-01-06   2                  2
4 2021-01-07   1                  3
5 2021-01-08   1                  4
6 2021-01-28   1                  1