Home > database >  Python - Count duplicate user Id's occurence in a given month
Python - Count duplicate user Id's occurence in a given month

Time:10-19

If I create a Dataframe from

df = pd.DataFrame({"date": ['2022-08-10','2022-08-18','2022-08-18','2022-08-20','2022-08-20','2022-08-24','2022-08-26','2022-08-30','2022-09-3','2022-09-8','2022-09-13'], 
              "id": ['A','B','C','D','E','B','A','F','G','F','H']})

df['date'] = pd.to_datetime(df['date'])

(Table 1 below showing the data)

I am interested in counting how many times an ID appears in a given month. For example in a given month A, B and F all occur twice whilst everything else occurs once. The difficulty with this data is that the the frequency of dates are not evenly spread out.

I attempted to resample on date by month, with the hope of counting duplicates.

df.resample('M', on='date')['id']

But all the functions that can be used on resample just give me the number of unique occurences rather than how many times each ID occured.

A rough example of the output is below [Table 2]

All of the examples I have seen merely count how many total or unique occurences occur for a given month, this question is focused on finding out how many occurences each Id had in a month.

Thankyou for your time.

[Table 1] - Data

idx date id
0 2022-08-10 A
1 2022-08-18 B
2 2022-08-18 C
3 2022-08-20 D
4 2022-08-20 E
5 2022-08-24 B
6 2022-08-26 A
7 2022-08-30 F
8 2022-09-03 G
9 2022-09-08 F
10 2022-09-13 H

[Table 2] - Rough example of desired output

id occurences in a month
A 2
B 2
C 1
D 1
E 1
F 2
G 1
H 1

CodePudding user response:

Use Series.dt.to_period for month periods and count values per id by GroupBy.size, then aggregate sum:

df1 = (df.groupby(['id', df['date'].dt.to_period('m')])
        .size()
        .groupby(level=0)
        .sum()
        .reset_index(name='occurences in a month'))
print (df1)
  id  occurences in a month
0  A                      2
1  B                      2
2  C                      1
3  D                      1
4  E                      1
5  F                      2
6  G                      1
7  H                      1

Or use Grouper:

df1 = (df.groupby(['id',pd.Grouper(freq='M', key='date')])
        .size()
        .groupby(level=0)
        .sum()
        .reset_index(name='occurences in a month'))
print (df1)

EDIT:

df = pd.DataFrame({"date": ['2022-08-10','2022-08-18','2022-08-18','2022-08-20','2022-08-20','2022-08-24','2022-08-26',
                            '2022-08-30','2022-09-3','2022-09-8','2022-09-13','2050-12-15'],
                   "id": ['A','B','C','D','E','B','A','F','G','F','H','H']}) 

df['date'] = pd.to_datetime(df['date'],format='%Y-%m-%d')

print (df)
        

Because count first per month or days or dates and sum values it is same like:

df1 = df.groupby('id').size().reset_index(name='occurences')
        
print (df1)
  id  occurences
0  A           2
1  B           2
2  C           1
3  D           1
4  E           1
5  F           2
6  G           1
7  H           2

Same sum of counts per id:

df1 = (df.groupby(['id', df['date'].dt.to_period('m')])
        .size())
print (df1)
id  date   
A   2022-08    2
B   2022-08    2
C   2022-08    1
D   2022-08    1
E   2022-08    1
F   2022-08    1
    2022-09    1
G   2022-09    1
H   2022-09    1
    2050-12    1
dtype: int64

df1 = (df.groupby(['id', df['date'].dt.to_period('d')])
        .size())
print (df1)
id  date      
A   2022-08-10    1
    2022-08-26    1
B   2022-08-18    1
    2022-08-24    1
C   2022-08-18    1
D   2022-08-20    1
E   2022-08-20    1
F   2022-08-30    1
    2022-09-08    1
G   2022-09-03    1
H   2022-09-13    1
    2050-12-15    1
dtype: int64

df1 = (df.groupby(['id', df['date'].dt.day])
        .size())
print (df1)
id  date
A   10      1
    26      1
B   18      1
    24      1
C   18      1
D   20      1
E   20      1
F   8       1
    30      1
G   3       1
H   13      1
    15      1
dtype: int64
  • Related