Home > Mobile >  Aggregation based on previous month from eventdate
Aggregation based on previous month from eventdate

Time:12-09

I'm Stuck on a problem it would be great if you could help me :) I created a dataframe with pandas: looks like that:

HostName Date
A 2021-01-01 12:30
B 2021-01-01 12:42
B 2021-02-01 12:30
A 2021-02-01 12:40
A 2021-02-25 12:40
A 2021-03-01 12:41
A 2021-03-01 12:42

I try to Aggregat based on the previous month but it's not working. the end result should look like this:

HostName Date previous month
A 2021-01-01 12:30 Nan
B 2021-01-01 12:42 Nan
B 2021-02-01 12:30 1
A 2021-02-01 12:40 Nan
A 2021-02-25 12:40 1
A 2021-03-01 12:41 2
A 2021-03-01 12:42 3

for every row Date should look one-month before and Aggregat the number of Hostnames found. for example row number 6 count HostName A from 2021-02-01 12:41 to 2021-03-01 12:41

what I try to do and failed:

  1. extract the previous month:
 df['Date Before'] = df['Date'] - pd.DateOffset(months=1)
  1. and Aggregate between this month
 df.resample('M', on='Date').HostName.count()
 df.groupby('HostName').resample('M', on='Date Before').HostName.count()

Please Help Me many thanks!!!

CodePudding user response:

use shift to look back a n rows for a dataframe column. df is the group by results.

 data1="""HostName  Date
A   2021-01-01 12:30
B   2021-01-01 12:42
B   2021-02-01 12:30
A   2021-02-01 12:40
A   2021-02-25 12:40
A   2021-03-01 12:41
A   2021-03-01 12:42"""
    
df = pd.read_table(StringIO(data1), sep='\t')    
df['Date']=pd.to_datetime(df['Date'])

grouped=df.groupby('HostName')['Date']

def previous_date(group):
    return group.sort_values().shift(1)

df['Previous Date']=grouped.apply(previous_date)

df['Previous Count']=df.apply(lambda x: x['Date']-x['Previous Date'],axis=1)

print(df.sort_values(by=["HostName","Date"]))


df['Con'] = np.where( (df['Previous Date'].notnull() & df['Previous Count']>0) , 1, 0)

print(df.sort_values(by=["HostName","Date"]))

output:

     HostName                Date       Previous Date  Previous Count  Con
0        A 2021-01-01 12:30:00                 NaT             NaN    0
3        A 2021-02-01 12:40:00 2021-01-01 12:30:00            31.0    1
4        A 2021-02-25 12:40:00 2021-02-01 12:40:00            24.0    1
5        A 2021-03-01 12:41:00 2021-02-25 12:40:00             4.0    1
6        A 2021-03-01 12:42:00 2021-03-01 12:41:00             0.0    0
1        B 2021-01-01 12:42:00                 NaT             NaN    0
2        B 2021-02-01 12:30:00 2021-01-01 12:42:00            30.0    1

use cumsum to create a running total by hostname ​

CodePudding user response:

i found solution: original:

HostName    Date
0   A   2021-01-01 12:30:00
1   B   2021-01-01 12:42:00
2   B   2021-02-01 12:30:00
3   A   2021-02-01 12:40:00
4   A   2021-02-25 12:40:00
5   A   2021-03-01 12:41:00
6   A   2021-03-01 12:42:00

get month before

df['Month Before'] = df['Date'] - pd.DateOffset(months=1)

order datafarme

df = df.sort_values(['HostName','Date'])

shift by Host

df['prev_value'] = df.groupby('HostName')['Date'].shift()

checking

df['Con'] = np.where((df['Month Before'] <= df['prev_value']) | (df['prev_value'].notnull()) , 1, 0)

and group

gpc = df.groupby(['HostName','Con'])['HostName']
df['Count Per Host'] = gpc.cumcount() 

look like that

    HostName    Date    Month Before                 prev_value        Con   CountPerHost
0   A   2021-01-01 12:30:00 2020-12-01 12:30:00     NaT                 0      0
3   A   2021-02-01 12:40:00 2021-01-01 12:40:00 2021-01-01 12:30:00     0      0
4   A   2021-02-25 12:40:00 2021-01-25 12:40:00 2021-02-01 12:40:00     1      1
5   A   2021-03-01 12:41:00 2021-02-01 12:41:00 2021-02-25 12:40:00     1      2
6   A   2021-03-01 12:42:00 2021-02-01 12:42:00 2021-03-01 12:41:00     1      3
1   B   2021-01-01 12:42:00 2020-12-01 12:42:00     NaT                 0      0
2   B   2021-02-01 12:30:00 2021-01-01 12:30:00 2021-01-01 12:42:00     1     0
  • Related