I have a dataframe with 3 columns:
'Date'
unique for each row'Group'
group to which each row belongs'ID'
identifier for each row
In the 'ID'
columns there are some duplicate values. I want to count the number of repetitions for each group and for each date.
Example of dataframe:
Date Group ID
0 2021-10-29 09:15:52 B 9352
1 2021-10-29 10:58:57 A 9352
2 2021-10-29 11:20:46 C 9352
3 2021-10-29 12:47:34 C 6274
4 2021-10-29 15:41:35 C 1677
5 2021-10-29 16:12:39 B 1677
6 2021-10-29 18:57:56 B 9225
7 2021-10-29 19:46:46 C 9225
8 2021-10-30 01:23:07 C 9225
9 2021-10-30 02:13:57 A 9225
10 2021-10-30 05:03:52 B 9329
11 2021-10-30 07:48:39 B 9329
12 2021-10-30 08:45:00 A 9329
13 2021-10-30 11:17:47 C 9329
14 2021-10-30 21:46:07 C 9496
15 2021-10-30 22:13:29 A 1218
16 2021-10-31 05:39:38 C 2422
17 2021-10-31 05:39:41 C 9654
18 2021-10-31 10:10:21 A 1951
19 2021-10-31 10:19:45 A 1951
20 2021-10-31 16:40:10 A 1951
21 2021-10-31 16:41:23 C 1951
22 2021-10-31 22:07:16 A 1951
23 2021-11-01 00:26:30 C 2867
24 2021-11-01 01:25:46 B 2867
25 2021-11-01 01:53:16 B 4262
26 2021-11-01 01:58:30 A 4581
27 2021-11-01 05:23:26 C 1734
28 2021-11-01 05:38:22 C 1734
29 2021-11-01 05:47:43 C 1734
30 2021-11-01 06:49:27 A 4813
31 2021-11-01 07:54:02 C 4813
32 2021-11-01 12:10:48 C 8661
33 2021-11-01 14:32:50 C 4138
34 2021-11-01 18:38:16 B 4138
35 2021-11-01 21:33:37 C 4138
36 2021-11-02 02:34:53 B 4138
37 2021-11-02 04:45:56 C 4138
38 2021-11-02 07:33:38 C 4138
39 2021-11-02 07:40:33 C 4138
40 2021-11-02 08:06:21 B 4138
41 2021-11-02 08:32:20 C 4138
42 2021-11-02 09:47:26 A 4138
43 2021-11-02 15:51:33 C 4138
44 2021-11-02 16:04:33 B 2433
45 2021-11-02 20:47:01 B 2433
46 2021-11-03 04:11:57 A 4594
47 2021-11-03 12:36:16 A 6829
48 2021-11-03 14:42:14 A 6829
49 2021-11-03 18:03:27 B 7138
50 2021-11-03 18:46:46 C 7138
51 2021-11-03 19:07:01 C 7138
52 2021-11-03 19:23:02 A 9752
53 2021-11-03 21:10:51 A 2699
54 2021-11-04 00:58:12 C 2699
55 2021-11-04 03:44:12 A 7463
56 2021-11-04 05:40:07 C 4558
57 2021-11-04 05:56:51 C 7855
58 2021-11-04 06:27:28 C 7855
59 2021-11-04 07:50:46 C 7855
Desired result:
Date Group Repetitions
0 2021-10-29 A 1
1 2021-10-29 B 3
2 2021-10-29 C 3
3 2021-10-30 A 2
4 2021-10-30 B 2
5 2021-10-30 C 2
6 2021-10-31 A 4
7 2021-10-31 B 0
8 2021-10-31 C 1
9 2021-11-01 A 1
10 2021-11-01 B 2
11 2021-11-01 C 7
12 2021-11-02 A 1
13 2021-11-02 B 4
14 2021-11-02 C 5
15 2021-11-03 A 3
16 2021-11-03 B 1
17 2021-11-03 C 2
18 2021-11-04 A 0
19 2021-11-04 B 0
20 2021-11-04 C 4
Note that the duplicate conditions spans over dates and groups: the 'ID'
2699
in the example above counts as duplicate even if those repetitions belongs to different dates and groups.
CodePudding user response:
IIUC, you can replace the duplicated values with 1, the others with zero, then groupby
agg
sum
:
(df.assign(Date=pd.to_datetime(df['Date']).dt.normalize(),
ID=df['ID'].duplicated(keep=False).astype(int)
)
.groupby(['Date', 'Group'], as_index=False).agg(repetitions=('ID', 'sum'))
)
output:
Date Group repetitions
0 2021-10-29 A 1
1 2021-10-29 B 3
2 2021-10-29 C 3
3 2021-10-30 A 2
4 2021-10-30 B 2
5 2021-10-30 C 2
6 2021-10-31 A 4
7 2021-10-31 C 1
8 2021-11-01 A 1
9 2021-11-01 B 2
10 2021-11-01 C 7
11 2021-11-02 A 1
12 2021-11-02 B 4
13 2021-11-02 C 5
14 2021-11-03 A 3
15 2021-11-03 B 1
16 2021-11-03 C 2
17 2021-11-04 A 0
18 2021-11-04 C 4
adding the missing combinations
dates = pd.to_datetime(df['Date']).dt.normalize()
idx = pd.MultiIndex.from_product([dates.unique(), df['Group'].unique()], names=['Date', 'Group'])
(df.assign(Date=dates,
ID=df['ID'].duplicated(keep=False).astype(int)
)
.groupby(['Date', 'Group']).agg(repetitions=('ID', 'sum'))
.reindex(idx, fill_value=0)
.reset_index()
)
output:
Date Group repetitions
0 2021-10-29 B 3
1 2021-10-29 A 1
2 2021-10-29 C 3
3 2021-10-30 B 2
4 2021-10-30 A 2
5 2021-10-30 C 2
6 2021-10-31 B 0
7 2021-10-31 A 4
8 2021-10-31 C 1
9 2021-11-01 B 2
10 2021-11-01 A 1
11 2021-11-01 C 7
12 2021-11-02 B 4
13 2021-11-02 A 1
14 2021-11-02 C 5
15 2021-11-03 B 1
16 2021-11-03 A 3
17 2021-11-03 C 2
18 2021-11-04 B 0
19 2021-11-04 A 0
20 2021-11-04 C 4
CodePudding user response:
df['ID'].value_counts()
should give you the amount of times each id occurs.