Aggregate efficiently between dates-CodePudding

Hello I Have a Df Look like that:

  HostName      Date        
0   B   2021-01-01 12:42:00
1   B   2021-02-01 12:30:00  
2   B   2021-02-01 12:40:00  
3   B   2021-02-25 12:40:00  
4   B   2021-03-01 12:41:00  
5   B   2021-03-01 12:42:00  
6   B   2021-03-02 12:43:00  
7   B   2021-03-03 12:44:00  
8   B   2021-04-04 12:44:00  
9   B   2021-06-05 12:44:00  
10  B   2021-08-06 12:44:00  
11  B   2021-09-07 12:44:00  
12  A   2021-03-12 12:45:00  
13  A   2021-03-13 12:46:00

i what do to aggregation this is how I solved the problem but its not efficient at all and if there are 1M rows it will take a long time is there a better way to Aggregate efficiently between dates?

end results:

  HostName      Date        ds
0   B   2021-01-01 12:42:00  1
1   B   2021-02-01 12:30:00  2
2   B   2021-02-01 12:40:00  3
3   B   2021-02-25 12:40:00  3
4   B   2021-03-01 12:41:00  2
5   B   2021-03-01 12:42:00  3
6   B   2021-03-02 12:43:00  4
7   B   2021-03-03 12:44:00  5
8   B   2021-04-04 12:44:00  1
9   B   2021-06-05 12:44:00  1
10  B   2021-08-06 12:44:00  1
11  B   2021-09-07 12:44:00  1
12  A   2021-03-12 12:45:00  1
13  A   2021-03-13 12:46:00  2

TheList = []
for index, row in df.iterrows():
    TheList.append((df[(df['Date'] > (df['Date'].iloc[index] - pd.DateOffset(months=1))) & (df['Date'] <= df['Date'].iloc[index])].groupby(['HostName']).size()[row[0]]))
df['ds'] = TheList

is there is a better way to do it but with the same result?

CodePudding user response：

Here is used broadcasting between groups and for count Trues is used sum in custom function in GroupBy.transform:

Notice: Performance depends also by length of groups, if few very big groups here should be problem with memory.

df['Date'] = pd.to_datetime(df['Date'])

def f(x):
    a = x.to_numpy()
    b = x.sub(pd.DateOffset(months=1)).to_numpy()
    return np.sum((a > b[:, None]) & (a <= a[:, None]), axis=1)

df['ds'] = df.groupby('HostName')['Date'].transform(f)

print (df)
   HostName                Date  ds
0         B 2021-01-01 12:42:00   1
1         B 2021-02-01 12:30:00   2
2         B 2021-02-01 12:40:00   3
3         B 2021-02-25 12:40:00   3
4         B 2021-03-01 12:41:00   2
5         B 2021-03-01 12:42:00   3
6         B 2021-03-02 12:43:00   4
7         B 2021-03-03 12:44:00   5
8         B 2021-04-04 12:44:00   1
9         B 2021-06-05 12:44:00   1
10        B 2021-08-06 12:44:00   1
11        B 2021-09-07 12:44:00   1
12        A 2021-03-12 12:45:00   1
13        A 2021-03-13 12:46:00   2

Unfortunately need loops if memory problems:

df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = pd.to_datetime(df['Date']).sub(pd.DateOffset(months=1))

def f(x):
    one = x['Date'].to_numpy()
    both = x[['Date','Date1']].to_numpy()
    
    x['ds'] = [np.sum((one > b) & (one <= a))  for a, b in both]
    return x

df = df.groupby('HostName').apply(f)
print (df)
   HostName                Date               Date1  ds
0         B 2021-01-01 12:42:00 2020-12-01 12:42:00   1
1         B 2021-02-01 12:30:00 2021-01-01 12:30:00   2
2         B 2021-02-01 12:40:00 2021-01-01 12:40:00   3
3         B 2021-02-25 12:40:00 2021-01-25 12:40:00   3
4         B 2021-03-01 12:41:00 2021-02-01 12:41:00   2
5         B 2021-03-01 12:42:00 2021-02-01 12:42:00   3
6         B 2021-03-02 12:43:00 2021-02-02 12:43:00   4
7         B 2021-03-03 12:44:00 2021-02-03 12:44:00   5
8         B 2021-04-04 12:44:00 2021-03-04 12:44:00   1
9         B 2021-06-05 12:44:00 2021-05-05 12:44:00   1
10        B 2021-08-06 12:44:00 2021-07-06 12:44:00   1
11        B 2021-09-07 12:44:00 2021-08-07 12:44:00   1
12        A 2021-03-12 12:45:00 2021-02-12 12:45:00   1
13        A 2021-03-13 12:46:00 2021-02-13 12:46:00   2