Home > Back-end >  pandas dataframe count duplicates by group and date
pandas dataframe count duplicates by group and date

Time:11-15

I have a dataframe with 3 columns:

  • 'Date' unique for each row
  • 'Group' group to which each row belongs
  • 'ID' identifier for each row

In the 'ID' columns there are some duplicate values. I want to count the number of repetitions for each group and for each date.
Example of dataframe:

                   Date Group    ID
0   2021-10-29 09:15:52     B  9352
1   2021-10-29 10:58:57     A  9352
2   2021-10-29 11:20:46     C  9352
3   2021-10-29 12:47:34     C  6274
4   2021-10-29 15:41:35     C  1677
5   2021-10-29 16:12:39     B  1677
6   2021-10-29 18:57:56     B  9225
7   2021-10-29 19:46:46     C  9225
8   2021-10-30 01:23:07     C  9225
9   2021-10-30 02:13:57     A  9225
10  2021-10-30 05:03:52     B  9329
11  2021-10-30 07:48:39     B  9329
12  2021-10-30 08:45:00     A  9329
13  2021-10-30 11:17:47     C  9329
14  2021-10-30 21:46:07     C  9496
15  2021-10-30 22:13:29     A  1218
16  2021-10-31 05:39:38     C  2422
17  2021-10-31 05:39:41     C  9654
18  2021-10-31 10:10:21     A  1951
19  2021-10-31 10:19:45     A  1951
20  2021-10-31 16:40:10     A  1951
21  2021-10-31 16:41:23     C  1951
22  2021-10-31 22:07:16     A  1951
23  2021-11-01 00:26:30     C  2867
24  2021-11-01 01:25:46     B  2867
25  2021-11-01 01:53:16     B  4262
26  2021-11-01 01:58:30     A  4581
27  2021-11-01 05:23:26     C  1734
28  2021-11-01 05:38:22     C  1734
29  2021-11-01 05:47:43     C  1734
30  2021-11-01 06:49:27     A  4813
31  2021-11-01 07:54:02     C  4813
32  2021-11-01 12:10:48     C  8661
33  2021-11-01 14:32:50     C  4138
34  2021-11-01 18:38:16     B  4138
35  2021-11-01 21:33:37     C  4138
36  2021-11-02 02:34:53     B  4138
37  2021-11-02 04:45:56     C  4138
38  2021-11-02 07:33:38     C  4138
39  2021-11-02 07:40:33     C  4138
40  2021-11-02 08:06:21     B  4138
41  2021-11-02 08:32:20     C  4138
42  2021-11-02 09:47:26     A  4138
43  2021-11-02 15:51:33     C  4138
44  2021-11-02 16:04:33     B  2433
45  2021-11-02 20:47:01     B  2433
46  2021-11-03 04:11:57     A  4594
47  2021-11-03 12:36:16     A  6829
48  2021-11-03 14:42:14     A  6829
49  2021-11-03 18:03:27     B  7138
50  2021-11-03 18:46:46     C  7138
51  2021-11-03 19:07:01     C  7138
52  2021-11-03 19:23:02     A  9752
53  2021-11-03 21:10:51     A  2699
54  2021-11-04 00:58:12     C  2699
55  2021-11-04 03:44:12     A  7463
56  2021-11-04 05:40:07     C  4558
57  2021-11-04 05:56:51     C  7855
58  2021-11-04 06:27:28     C  7855
59  2021-11-04 07:50:46     C  7855

Desired result:

          Date Group  Repetitions
0   2021-10-29     A            1
1   2021-10-29     B            3
2   2021-10-29     C            3
3   2021-10-30     A            2
4   2021-10-30     B            2
5   2021-10-30     C            2
6   2021-10-31     A            4
7   2021-10-31     B            0
8   2021-10-31     C            1
9   2021-11-01     A            1
10  2021-11-01     B            2
11  2021-11-01     C            7
12  2021-11-02     A            1
13  2021-11-02     B            4
14  2021-11-02     C            5
15  2021-11-03     A            3
16  2021-11-03     B            1
17  2021-11-03     C            2
18  2021-11-04     A            0
19  2021-11-04     B            0
20  2021-11-04     C            4

Note that the duplicate conditions spans over dates and groups: the 'ID' 2699 in the example above counts as duplicate even if those repetitions belongs to different dates and groups.

CodePudding user response:

IIUC, you can replace the duplicated values with 1, the others with zero, then groupby agg sum:

(df.assign(Date=pd.to_datetime(df['Date']).dt.normalize(),
           ID=df['ID'].duplicated(keep=False).astype(int)
          )
   .groupby(['Date', 'Group'], as_index=False).agg(repetitions=('ID', 'sum'))
)

output:

         Date Group  repetitions
0  2021-10-29     A            1
1  2021-10-29     B            3
2  2021-10-29     C            3
3  2021-10-30     A            2
4  2021-10-30     B            2
5  2021-10-30     C            2
6  2021-10-31     A            4
7  2021-10-31     C            1
8  2021-11-01     A            1
9  2021-11-01     B            2
10 2021-11-01     C            7
11 2021-11-02     A            1
12 2021-11-02     B            4
13 2021-11-02     C            5
14 2021-11-03     A            3
15 2021-11-03     B            1
16 2021-11-03     C            2
17 2021-11-04     A            0
18 2021-11-04     C            4

adding the missing combinations

dates = pd.to_datetime(df['Date']).dt.normalize()
idx = pd.MultiIndex.from_product([dates.unique(), df['Group'].unique()], names=['Date', 'Group'])

(df.assign(Date=dates,
           ID=df['ID'].duplicated(keep=False).astype(int)
          )
   .groupby(['Date', 'Group']).agg(repetitions=('ID', 'sum'))
   .reindex(idx, fill_value=0)
   .reset_index()
)

output:

         Date Group  repetitions
0  2021-10-29     B            3
1  2021-10-29     A            1
2  2021-10-29     C            3
3  2021-10-30     B            2
4  2021-10-30     A            2
5  2021-10-30     C            2
6  2021-10-31     B            0
7  2021-10-31     A            4
8  2021-10-31     C            1
9  2021-11-01     B            2
10 2021-11-01     A            1
11 2021-11-01     C            7
12 2021-11-02     B            4
13 2021-11-02     A            1
14 2021-11-02     C            5
15 2021-11-03     B            1
16 2021-11-03     A            3
17 2021-11-03     C            2
18 2021-11-04     B            0
19 2021-11-04     A            0
20 2021-11-04     C            4

CodePudding user response:

df['ID'].value_counts()

should give you the amount of times each id occurs.

  • Related