I'm Stuck on a problem it would be great if you could help me :) I created a dataframe with pandas: looks like that:
HostName | Date |
---|---|
A | 2021-01-01 12:30 |
B | 2021-01-01 12:42 |
B | 2021-02-01 12:30 |
A | 2021-02-01 12:40 |
A | 2021-02-25 12:40 |
A | 2021-03-01 12:41 |
A | 2021-03-01 12:42 |
I try to Aggregat based on the previous month but it's not working. the end result should look like this:
HostName | Date | previous month |
---|---|---|
A | 2021-01-01 12:30 | Nan |
B | 2021-01-01 12:42 | Nan |
B | 2021-02-01 12:30 | 1 |
A | 2021-02-01 12:40 | Nan |
A | 2021-02-25 12:40 | 1 |
A | 2021-03-01 12:41 | 2 |
A | 2021-03-01 12:42 | 3 |
for every row Date should look one-month before and Aggregat the number of Hostnames found. for example row number 6 count HostName A from 2021-02-01 12:41 to 2021-03-01 12:41
what I try to do and failed:
- extract the previous month:
df['Date Before'] = df['Date'] - pd.DateOffset(months=1)
- and Aggregate between this month
df.resample('M', on='Date').HostName.count()
df.groupby('HostName').resample('M', on='Date Before').HostName.count()
Please Help Me many thanks!!!
CodePudding user response:
use shift to look back a n rows for a dataframe column. df is the group by results.
data1="""HostName Date
A 2021-01-01 12:30
B 2021-01-01 12:42
B 2021-02-01 12:30
A 2021-02-01 12:40
A 2021-02-25 12:40
A 2021-03-01 12:41
A 2021-03-01 12:42"""
df = pd.read_table(StringIO(data1), sep='\t')
df['Date']=pd.to_datetime(df['Date'])
grouped=df.groupby('HostName')['Date']
def previous_date(group):
return group.sort_values().shift(1)
df['Previous Date']=grouped.apply(previous_date)
df['Previous Count']=df.apply(lambda x: x['Date']-x['Previous Date'],axis=1)
print(df.sort_values(by=["HostName","Date"]))
df['Con'] = np.where( (df['Previous Date'].notnull() & df['Previous Count']>0) , 1, 0)
print(df.sort_values(by=["HostName","Date"]))
output:
HostName Date Previous Date Previous Count Con
0 A 2021-01-01 12:30:00 NaT NaN 0
3 A 2021-02-01 12:40:00 2021-01-01 12:30:00 31.0 1
4 A 2021-02-25 12:40:00 2021-02-01 12:40:00 24.0 1
5 A 2021-03-01 12:41:00 2021-02-25 12:40:00 4.0 1
6 A 2021-03-01 12:42:00 2021-03-01 12:41:00 0.0 0
1 B 2021-01-01 12:42:00 NaT NaN 0
2 B 2021-02-01 12:30:00 2021-01-01 12:42:00 30.0 1
use cumsum to create a running total by hostname
CodePudding user response:
i found solution: original:
HostName Date
0 A 2021-01-01 12:30:00
1 B 2021-01-01 12:42:00
2 B 2021-02-01 12:30:00
3 A 2021-02-01 12:40:00
4 A 2021-02-25 12:40:00
5 A 2021-03-01 12:41:00
6 A 2021-03-01 12:42:00
get month before
df['Month Before'] = df['Date'] - pd.DateOffset(months=1)
order datafarme
df = df.sort_values(['HostName','Date'])
shift by Host
df['prev_value'] = df.groupby('HostName')['Date'].shift()
checking
df['Con'] = np.where((df['Month Before'] <= df['prev_value']) | (df['prev_value'].notnull()) , 1, 0)
and group
gpc = df.groupby(['HostName','Con'])['HostName']
df['Count Per Host'] = gpc.cumcount()
look like that
HostName Date Month Before prev_value Con CountPerHost
0 A 2021-01-01 12:30:00 2020-12-01 12:30:00 NaT 0 0
3 A 2021-02-01 12:40:00 2021-01-01 12:40:00 2021-01-01 12:30:00 0 0
4 A 2021-02-25 12:40:00 2021-01-25 12:40:00 2021-02-01 12:40:00 1 1
5 A 2021-03-01 12:41:00 2021-02-01 12:41:00 2021-02-25 12:40:00 1 2
6 A 2021-03-01 12:42:00 2021-02-01 12:42:00 2021-03-01 12:41:00 1 3
1 B 2021-01-01 12:42:00 2020-12-01 12:42:00 NaT 0 0
2 B 2021-02-01 12:30:00 2021-01-01 12:30:00 2021-01-01 12:42:00 1 0