Python - Count duplicate user Id's occurence in a given month-CodePudding

If I create a Dataframe from

df = pd.DataFrame({"date": ['2022-08-10','2022-08-18','2022-08-18','2022-08-20','2022-08-20','2022-08-24','2022-08-26','2022-08-30','2022-09-3','2022-09-8','2022-09-13'], 
              "id": ['A','B','C','D','E','B','A','F','G','F','H']})

df['date'] = pd.to_datetime(df['date'])

(Table 1 below showing the data)

I am interested in counting how many times an ID appears in a given month. For example in a given month A, B and F all occur twice whilst everything else occurs once. The difficulty with this data is that the the frequency of dates are not evenly spread out.

I attempted to resample on date by month, with the hope of counting duplicates.

df.resample('M', on='date')['id']

But all the functions that can be used on resample just give me the number of unique occurences rather than how many times each ID occured.

A rough example of the output is below [Table 2]

All of the examples I have seen merely count how many total or unique occurences occur for a given month, this question is focused on finding out how many occurences each Id had in a month.

Thankyou for your time.

[Table 1] - Data

idx	date	id
0	2022-08-10	A
1	2022-08-18	B
2	2022-08-18	C
3	2022-08-20	D
4	2022-08-20	E
5	2022-08-24	B
6	2022-08-26	A
7	2022-08-30	F
8	2022-09-03	G
9	2022-09-08	F
10	2022-09-13	H

[Table 2] - Rough example of desired output

id	occurences in a month
A	2
B	2
C	1
D	1
E	1
F	2
G	1
H	1

CodePudding user response：

Use Series.dt.to_period for month periods and count values per id by GroupBy.size, then aggregate sum:

df1 = (df.groupby(['id', df['date'].dt.to_period('m')])
        .size()
        .groupby(level=0)
        .sum()
        .reset_index(name='occurences in a month'))
print (df1)
  id  occurences in a month
0  A                      2
1  B                      2
2  C                      1
3  D                      1
4  E                      1
5  F                      2
6  G                      1
7  H                      1

Or use Grouper:

df1 = (df.groupby(['id',pd.Grouper(freq='M', key='date')])
        .size()
        .groupby(level=0)
        .sum()
        .reset_index(name='occurences in a month'))
print (df1)

EDIT:

df = pd.DataFrame({"date": ['2022-08-10','2022-08-18','2022-08-18','2022-08-20','2022-08-20','2022-08-24','2022-08-26',
                            '2022-08-30','2022-09-3','2022-09-8','2022-09-13','2050-12-15'],
                   "id": ['A','B','C','D','E','B','A','F','G','F','H','H']}) 

df['date'] = pd.to_datetime(df['date'],format='%Y-%m-%d')

print (df)

Because count first per month or days or dates and sum values it is same like:

df1 = df.groupby('id').size().reset_index(name='occurences')
        
print (df1)
  id  occurences
0  A           2
1  B           2
2  C           1
3  D           1
4  E           1
5  F           2
6  G           1
7  H           2

Same sum of counts per id:

df1 = (df.groupby(['id', df['date'].dt.to_period('m')])
        .size())
print (df1)
id  date   
A   2022-08    2
B   2022-08    2
C   2022-08    1
D   2022-08    1
E   2022-08    1
F   2022-08    1
    2022-09    1
G   2022-09    1
H   2022-09    1
    2050-12    1
dtype: int64

df1 = (df.groupby(['id', df['date'].dt.to_period('d')])
        .size())
print (df1)
id  date      
A   2022-08-10    1
    2022-08-26    1
B   2022-08-18    1
    2022-08-24    1
C   2022-08-18    1
D   2022-08-20    1
E   2022-08-20    1
F   2022-08-30    1
    2022-09-08    1
G   2022-09-03    1
H   2022-09-13    1
    2050-12-15    1
dtype: int64

df1 = (df.groupby(['id', df['date'].dt.day])
        .size())
print (df1)
id  date
A   10      1
    26      1
B   18      1
    24      1
C   18      1
D   20      1
E   20      1
F   8       1
    30      1
G   3       1
H   13      1
    15      1
dtype: int64