I have the pandas dataframe below:
groupId | date | value |
---|---|---|
1 | 2023-01-01 | A |
1 | 2023-01-05 | B |
1 | 2023-01-17 | C |
2 | 2023-01-01 | A |
2 | 2023-01-20 | B |
3 | 2023-01-01 | A |
3 | 2023-01-10 | B |
3 | 2023-01-12 | C |
I would like to do a groupby and count the number of unique values for each groupId
but only looking at the last n=14 days, relatively to the date
of the row.
What I would like as a result is something like this:
groupId | date | value | newColumn |
---|---|---|---|
1 | 2023-01-01 | A | 1 |
1 | 2023-01-05 | B | 2 |
1 | 2023-01-17 | C | 2 |
2 | 2023-01-01 | A | 1 |
2 | 2023-01-20 | B | 1 |
3 | 2023-01-01 | A | 1 |
3 | 2023-01-10 | B | 2 |
3 | 2023-01-12 | C | 3 |
I did try using a groupby(...).rolling('14d').nunique()
and while the rolling
function works on numeric fields to count and compute the mean, etc ... it doesn't work when used with nunique
on string fields to count the number of unique string/object values.
You can use the code below to generate the dataframe.
pd.DataFrame(
{
'groupId': [1, 1, 1, 2, 2, 3, 3, 3],
'date': ['2023-01-01', '2023-01-05', '2023-01-17', '2023-01-01', '2023-01-20', '2023-01-01', '2023-01-10', '2023-01-12'], #YYYY-MM-DD
'value': ['A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'],
'newColumn': [1, 2, 2, 1, 1, 1, 2, 3]
}
)
Do you have an idea on how to solve this, even if not using the rolling
function? That'd be much appreciated!
CodePudding user response:
Instead of nunique
, you can also use count
:
>>> (df.groupby('groupId').rolling('14D', on='date')['value'].count()
.astype(int).rename('newColumn').reset_index())
groupId date newColumn
0 1 2023-01-01 1
1 1 2023-01-05 2
2 1 2023-01-17 2
3 2 2023-01-01 1
4 2 2023-01-20 1
5 3 2023-01-01 1
6 3 2023-01-10 2
7 3 2023-01-12 3
Caveats: it can be complicated to merge this output with your original dataframe except if (groupId, date) is a unique combination.
Update
If your index is numeric (or create a dummy column monotonic increasing), you can use this trick:
sr = (df.reset_index().groupby('groupId').rolling('14D', on='date')
.agg({'value': 'count', 'index': 'max'}).astype(int)
.set_index('index')['value'])
df['newColumn'] = sr
print(df)
# Output
groupId date value newColumn
0 1 2023-01-01 A 1
1 1 2023-01-05 B 2
2 1 2023-01-17 C 2
3 2 2023-01-01 A 1
4 2 2023-01-20 B 1
5 3 2023-01-01 A 1
6 3 2023-01-10 B 2
7 3 2023-01-12 C 3
Update 2
You can use pd.factorize
to convert value
column as numeric column:
>>> (df.assign(value=pd.factorize(df['value'])[0])
.groupby('groupId').rolling('14D', on='date')['value']
.apply(lambda x: x.nunique())
.astype(int).rename('newColumn').reset_index())
groupId date newColumn
0 1 2023-01-01 1
1 1 2023-01-05 2
2 1 2023-01-17 2
3 2 2023-01-01 1
4 2 2023-01-20 1
5 3 2023-01-01 1
6 3 2023-01-10 2
7 3 2023-01-12 3
CodePudding user response:
Another possible solution, which does not use rolling
:
df['date'] = pd.to_datetime(df['date'])
df['new2'] = df.groupby('groupId')['date'].transform(
lambda x: x.diff().dt.days.cumsum().le(14).mul(~x.duplicated()).cumsum() 1)
Output:
groupId date value new2
0 1 2023-01-01 A 1
1 1 2023-01-05 B 2
2 1 2023-01-17 C 2
3 2 2023-01-01 A 1
4 2 2023-01-20 B 1
5 3 2023-01-01 A 1
6 3 2023-01-10 B 2
7 3 2023-01-12 C 3