Hello I Have a Df Look like that:
HostName Date
0 B 2021-01-01 12:42:00
1 B 2021-02-01 12:30:00
2 B 2021-02-01 12:40:00
3 B 2021-02-25 12:40:00
4 B 2021-03-01 12:41:00
5 B 2021-03-01 12:42:00
6 B 2021-03-02 12:43:00
7 B 2021-03-03 12:44:00
8 B 2021-04-04 12:44:00
9 B 2021-06-05 12:44:00
10 B 2021-08-06 12:44:00
11 B 2021-09-07 12:44:00
12 A 2021-03-12 12:45:00
13 A 2021-03-13 12:46:00
i what do to aggregation this is how I solved the problem but its not efficient at all and if there are 1M rows it will take a long time is there a better way to Aggregate efficiently between dates?
end results:
HostName Date ds
0 B 2021-01-01 12:42:00 1
1 B 2021-02-01 12:30:00 2
2 B 2021-02-01 12:40:00 3
3 B 2021-02-25 12:40:00 3
4 B 2021-03-01 12:41:00 2
5 B 2021-03-01 12:42:00 3
6 B 2021-03-02 12:43:00 4
7 B 2021-03-03 12:44:00 5
8 B 2021-04-04 12:44:00 1
9 B 2021-06-05 12:44:00 1
10 B 2021-08-06 12:44:00 1
11 B 2021-09-07 12:44:00 1
12 A 2021-03-12 12:45:00 1
13 A 2021-03-13 12:46:00 2
TheList = []
for index, row in df.iterrows():
TheList.append((df[(df['Date'] > (df['Date'].iloc[index] - pd.DateOffset(months=1))) & (df['Date'] <= df['Date'].iloc[index])].groupby(['HostName']).size()[row[0]]))
df['ds'] = TheList
is there is a better way to do it but with the same result?
CodePudding user response:
Here is used broadcasting between groups and for count True
s is used sum
in custom function in GroupBy.transform
:
Notice: Performance depends also by length of groups, if few very big groups here should be problem with memory.
df['Date'] = pd.to_datetime(df['Date'])
def f(x):
a = x.to_numpy()
b = x.sub(pd.DateOffset(months=1)).to_numpy()
return np.sum((a > b[:, None]) & (a <= a[:, None]), axis=1)
df['ds'] = df.groupby('HostName')['Date'].transform(f)
print (df)
HostName Date ds
0 B 2021-01-01 12:42:00 1
1 B 2021-02-01 12:30:00 2
2 B 2021-02-01 12:40:00 3
3 B 2021-02-25 12:40:00 3
4 B 2021-03-01 12:41:00 2
5 B 2021-03-01 12:42:00 3
6 B 2021-03-02 12:43:00 4
7 B 2021-03-03 12:44:00 5
8 B 2021-04-04 12:44:00 1
9 B 2021-06-05 12:44:00 1
10 B 2021-08-06 12:44:00 1
11 B 2021-09-07 12:44:00 1
12 A 2021-03-12 12:45:00 1
13 A 2021-03-13 12:46:00 2
Unfortunately need loops if memory problems:
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = pd.to_datetime(df['Date']).sub(pd.DateOffset(months=1))
def f(x):
one = x['Date'].to_numpy()
both = x[['Date','Date1']].to_numpy()
x['ds'] = [np.sum((one > b) & (one <= a)) for a, b in both]
return x
df = df.groupby('HostName').apply(f)
print (df)
HostName Date Date1 ds
0 B 2021-01-01 12:42:00 2020-12-01 12:42:00 1
1 B 2021-02-01 12:30:00 2021-01-01 12:30:00 2
2 B 2021-02-01 12:40:00 2021-01-01 12:40:00 3
3 B 2021-02-25 12:40:00 2021-01-25 12:40:00 3
4 B 2021-03-01 12:41:00 2021-02-01 12:41:00 2
5 B 2021-03-01 12:42:00 2021-02-01 12:42:00 3
6 B 2021-03-02 12:43:00 2021-02-02 12:43:00 4
7 B 2021-03-03 12:44:00 2021-02-03 12:44:00 5
8 B 2021-04-04 12:44:00 2021-03-04 12:44:00 1
9 B 2021-06-05 12:44:00 2021-05-05 12:44:00 1
10 B 2021-08-06 12:44:00 2021-07-06 12:44:00 1
11 B 2021-09-07 12:44:00 2021-08-07 12:44:00 1
12 A 2021-03-12 12:45:00 2021-02-12 12:45:00 1
13 A 2021-03-13 12:46:00 2021-02-13 12:46:00 2