I created a dataframe with pandas:
looks like that:
HostName | Date |
---|---|
B | 2021-01-01 12:30 |
A | 2021-01-01 12:45 |
C | 2021-01-01 12:46 |
A | 2021-02-01 12:42 |
B | 2021-02-01 12:43 |
A | 2021-02-01 12:45 |
B | 2021-02-25 12:46 |
C | 2021-03-01 12:41 |
A | 2021-03-01 12:42 |
A | 2021-03-01 12:43 |
C | 2021-03-01 12:45 |
For every day, it should count how many different HostName there is form the beginning of the day (example: 2021-01-01 00:00) to the specific row
Example: for example lets take the 2021-01-01
HostName | Date |
---|---|
B | 2021-01-01 12:30 |
A | 2021-01-01 12:45 |
C | 2021-01-01 12:46 |
there is tree rows:
- the first result would be 1 - because its was the first row in the day.(B)
- the second result would be 2 - because form the beginning of the day till this line there is two different Hostname (B,A)
- the third result would be 3 - because form the beginning of the day till this line there is tree different Hostname ( B,A,C)
the end result should look like this:
HostName | Date | Result |
---|---|---|
B | 2021-01-01 12:30 | 1 |
A | 2021-01-01 12:45 | 2 |
C | 2021-01-01 12:46 | 3 |
A | 2021-02-01 12:42 | 1 |
B | 2021-02-01 12:43 | 2 |
A | 2021-02-01 12:45 | 2 |
B | 2021-02-25 12:46 | 1 |
C | 2021-03-01 12:41 | 1 |
A | 2021-03-01 12:42 | 2 |
A | 2021-03-01 12:43 | 2 |
C | 2021-03-01 12:45 | 2 |
what it try do to but failed:
df.groupby(['HostName','Date')['HostName'].cumcount() 1
or
def f(x):
one = x['HostName'].to_numpy()
twe = x['Date'].to_numpy()
both = x[['HostName','Date']].shift(1).to_numpy()
x['Host_1D_CumCount_Conn'] = [np.sum((one == a) & (twe == b)) for a, b in both]
return x
df.groupby('HostName').apply(f)
CodePudding user response:
Use lambda function in GroupBy.transform
with lambda function with Series.duplicated
and cumulative sum:
df['Result'] = (df.groupby(df['Date'].dt.date)['HostName']
.transform(lambda x: (~x.duplicated()).cumsum()))
print (df)
HostName Date Result
0 B 2021-01-01 12:30:00 1
1 A 2021-01-01 12:45:00 2
2 C 2021-01-01 12:46:00 3
3 A 2021-02-01 12:42:00 1
4 B 2021-02-01 12:43:00 2
5 A 2021-02-01 12:45:00 2
6 B 2021-02-25 12:46:00 1
7 C 2021-03-01 12:41:00 1
8 A 2021-03-01 12:42:00 2
9 A 2021-03-01 12:43:00 2
10 C 2021-03-01 12:45:00 2
Alternative solution, faster is create helper columns d
for dates and duplicates per d
with HostName
and use GroupBy.cumsum
:
df['Result'] = (df.assign(d = df['Date'].dt.date,
new = lambda x: ~x.duplicated(['d','HostName']))
.groupby('d')['new']
.cumsum())
print (df)
HostName Date Result
0 B 2021-01-01 12:30:00 1
1 A 2021-01-01 12:45:00 2
2 C 2021-01-01 12:46:00 3
3 A 2021-02-01 12:42:00 1
4 B 2021-02-01 12:43:00 2
5 A 2021-02-01 12:45:00 2
6 B 2021-02-25 12:46:00 1
7 C 2021-03-01 12:41:00 1
8 A 2021-03-01 12:42:00 2
9 A 2021-03-01 12:43:00 2
10 C 2021-03-01 12:45:00 2
CodePudding user response:
You can groupby
the Date and use expanding
nunique
. The issue is that, currently, expanding only works with numerical values (I wish we could simply do expanding().nunique()
).
Thus we have to cheat a bit and factorize
the column to numbers before applying pd.Series.nunique
.
df['Result'] = (df.groupby(pd.to_datetime(df['Date']).dt.date, group_keys=False)
['HostName']
.apply(lambda s: pd.Series(s.factorize()[0]).expanding().apply(pd.Series.nunique))
.astype(int)
.values
)
output:
HostName Date Result
0 B 2021-01-01 12:30 1
1 A 2021-01-01 12:45 2
2 C 2021-01-01 12:46 3
3 A 2021-02-01 12:42 1
4 B 2021-02-01 12:43 2
5 A 2021-02-01 12:45 2
6 B 2021-02-25 12:46 1
7 C 2021-03-01 12:41 1
8 A 2021-03-01 12:42 2
9 A 2021-03-01 12:43 2
10 C 2021-03-01 12:45 2