Count How Many Different Users Have in specific day-CodePudding

I created a dataframe with pandas:

looks like that:

HostName	Date
B	2021-01-01 12:30
A	2021-01-01 12:45
C	2021-01-01 12:46
A	2021-02-01 12:42
B	2021-02-01 12:43
A	2021-02-01 12:45
B	2021-02-25 12:46
C	2021-03-01 12:41
A	2021-03-01 12:42
A	2021-03-01 12:43
C	2021-03-01 12:45

For every day, it should count how many different HostName there is form the beginning of the day (example: 2021-01-01 00:00) to the specific row

Example: for example lets take the 2021-01-01

HostName	Date
B	2021-01-01 12:30
A	2021-01-01 12:45
C	2021-01-01 12:46

there is tree rows:

the first result would be 1 - because its was the first row in the day.(B)
the second result would be 2 - because form the beginning of the day till this line there is two different Hostname (B,A)
the third result would be 3 - because form the beginning of the day till this line there is tree different Hostname ( B,A,C)

the end result should look like this:

HostName	Date	Result
B	2021-01-01 12:30	1
A	2021-01-01 12:45	2
C	2021-01-01 12:46	3
A	2021-02-01 12:42	1
B	2021-02-01 12:43	2
A	2021-02-01 12:45	2
B	2021-02-25 12:46	1
C	2021-03-01 12:41	1
A	2021-03-01 12:42	2
A	2021-03-01 12:43	2
C	2021-03-01 12:45	2

what it try do to but failed:

df.groupby(['HostName','Date')['HostName'].cumcount()   1

def f(x):
    one = x['HostName'].to_numpy()
    twe = x['Date'].to_numpy()
    both = x[['HostName','Date']].shift(1).to_numpy()  
    x['Host_1D_CumCount_Conn'] = [np.sum((one == a) & (twe == b))  for a, b in both]
    return x

df.groupby('HostName').apply(f)

CodePudding user response：

Use lambda function in GroupBy.transform with lambda function with Series.duplicated and cumulative sum:

df['Result'] = (df.groupby(df['Date'].dt.date)['HostName']
                  .transform(lambda x: (~x.duplicated()).cumsum()))
print (df)
   HostName                Date  Result
0         B 2021-01-01 12:30:00       1
1         A 2021-01-01 12:45:00       2
2         C 2021-01-01 12:46:00       3
3         A 2021-02-01 12:42:00       1
4         B 2021-02-01 12:43:00       2
5         A 2021-02-01 12:45:00       2
6         B 2021-02-25 12:46:00       1
7         C 2021-03-01 12:41:00       1
8         A 2021-03-01 12:42:00       2
9         A 2021-03-01 12:43:00       2
10        C 2021-03-01 12:45:00       2

Alternative solution, faster is create helper columns d for dates and duplicates per d with HostName and use GroupBy.cumsum:

df['Result'] = (df.assign(d = df['Date'].dt.date,
                          new = lambda x: ~x.duplicated(['d','HostName']))
                  .groupby('d')['new']
                  .cumsum())
print (df)
   HostName                Date  Result
0         B 2021-01-01 12:30:00       1
1         A 2021-01-01 12:45:00       2
2         C 2021-01-01 12:46:00       3
3         A 2021-02-01 12:42:00       1
4         B 2021-02-01 12:43:00       2
5         A 2021-02-01 12:45:00       2
6         B 2021-02-25 12:46:00       1
7         C 2021-03-01 12:41:00       1
8         A 2021-03-01 12:42:00       2
9         A 2021-03-01 12:43:00       2
10        C 2021-03-01 12:45:00       2

CodePudding user response：

You can groupby the Date and use expanding nunique. The issue is that, currently, expanding only works with numerical values (I wish we could simply do expanding().nunique()).

Thus we have to cheat a bit and factorize the column to numbers before applying pd.Series.nunique.

df['Result'] = (df.groupby(pd.to_datetime(df['Date']).dt.date, group_keys=False)
                  ['HostName']
                  .apply(lambda s: pd.Series(s.factorize()[0]).expanding().apply(pd.Series.nunique))
                  .astype(int)
                  .values
                )

output:

   HostName              Date  Result
0         B  2021-01-01 12:30       1
1         A  2021-01-01 12:45       2
2         C  2021-01-01 12:46       3
3         A  2021-02-01 12:42       1
4         B  2021-02-01 12:43       2
5         A  2021-02-01 12:45       2
6         B  2021-02-25 12:46       1
7         C  2021-03-01 12:41       1
8         A  2021-03-01 12:42       2
9         A  2021-03-01 12:43       2
10        C  2021-03-01 12:45       2