I have a Dataframe
containing dates and ids that are grouped by weeks
df:
date id
2022-02-07 1
3
5
4
2022-02-14 2
1
3
2022-02-21 9
10
1
...
...
2022-05-16 ....
I want to count for each week how much of the id's are repeating from previous week
For example the desired output for the Dataframe
would be:
date count
2022-02-07 0
2022-02-14 2 # because id 1 and 3 are present in previous week
2022-02-21 1 # because id 1 is present in previous week
...
I tried grouping the id and counting for each id how many are repeating for each date but it didn't work out as planned.
CodePudding user response:
Solution
s = pd.crosstab(df.date, df.id)
(s.eq(s.shift()) & s.ne(0)).sum(1)
Explained
Create a frequency table with crosstab
>>> pd.crosstab(df.date, df.id)
id 1 2 3 4 5 9 10
date
2022-02-07 1 0 1 1 1 0 0
2022-02-14 1 1 1 0 0 0 0
2022-02-21 1 0 0 0 0 1 1
Compare the values in previous row with the current row, to check the values that are repeating. Also while comparing make sure that the current row has non-zero value
>>> s.eq(s.shift()) & s.ne(0)
id 1 2 3 4 5 9 10
date
2022-02-07 False False False False False False False
2022-02-14 True False True False False False False
2022-02-21 True False False False False False False
Sum the non-zero values along the column axis to get the counts of repeating ids:
>>> .sum(1)
date
2022-02-07 0
2022-02-14 2
2022-02-21 1
dtype: int64
CodePudding user response:
You can try group the original dataframe and merge id
column to list. Then calculate the normal element by comparing with the shift
.
out = df.groupby('date')['id'].apply(list)
lst = [[j for j in js if j in ks] for js, ks in zip(out, out.shift(fill_value=[[]])) ]
df = pd.DataFrame({'date': out.index,
'count': map(len, lst)})
print(df)
date count
0 2022-02-07 0
1 2022-02-14 2
2 2022-02-21 1