python pandas - optimal way to handle nested groupby-CodePudding

I'm currently trying to process some log files using python and pandas library. Logs contain simple information about request sent to the server and i want to extract information about sessions from them. Sessions here are defined as a sets of request made by the same user within specific period of time (e.g 30 minutes, counted from time of first request to the time of last request, request after this timeframe should be treated as part of a new session)

To do than, currently I am performing nested grouping: first I am using groupby to get requests per user and then grouping each user requests by 30 minutes intervals, to finally iterate over those intervals and chose those actually containing data:

    # example log entry:
    # id,host,time,method,url,response,bytes
    # 303372,XXX.XXX.XXX.XXX,1995-07-11 12:17:09,GET,/htbin/wais.com?IMAX,200,6923

       by_host = logs.groupby('host', sort=False)
         for host, frame in by_host:
           by_frame = frame.groupby(pd.Grouper(key='time', freq='30min', origin='start'))
           for date, logs in by_frame:
             if not logs.empty and logs.shape[0] > 1:
                session_calculations()

This of course is quite inefficient and makes calculations take considerable ammount of time. Is there any way to optimize this process? I wasn't able to come up with anything succesful.

edit:

                  host                time method                                           url  response  bytes
0          ***.novo.dk 1995-07-11 12:17:09    GET                                     /ksc.html       200   7067
1          ***.novo.dk 1995-07-11 12:17:48    GET               /shuttle/missions/missions.html       200   8678
2          ***.novo.dk 1995-07-11 12:23:10    GET     /shuttle/resources/orbiters/columbia.html       200   6922
3          ***.novo.dk 1995-08-09 12:48:48    GET  /shuttle/missions/sts-69/mission-sts-69.html       200  11264
4          ***.novo.dk 1995-08-09 12:49:48    GET               /shuttle/countdown/liftoff.html       200   4665

and expected result is a list of sessions extracted from requests:

   host session_time
0  ***.novo.dk 00:06:01 
1  ***.novo.dk 00:01:00

note that session_time here is the time difference between first and last request from input, after grouping them in 30 minute time windows.

CodePudding user response：

logs.groupby('host', sort=False).apply(
    lambda by_frame:by_frame.groupby(
        pd.Grouper(key='time', freq='30min', origin='start')
    ).apply(lambda logs: session_calculations() if (not logs.empty) and (logs.shape[0] > 1) else None)
)

CodePudding user response：

To define local time windows for each user, i.e. consider the origin as the time of the first request of each user, you can first group by 'host'. Then apply a function to each user's DataFrame, using GroupBy.apply, which handles the time grouping and computes the time duration of the user's sessions.

def session_duration_by_host(by_host):
    time_grouper = pd.Grouper(key='time', freq='30min', origin='start')
    duration = lambda time: time.max() - time.min()
    return ( 
        by_host.groupby(time_grouper)
               .agg(session_time = ('time', duration))
    )

res = (
    logs.groupby("host")
        .apply(session_duration_by_host)
        .reset_index()
        .drop(columns="time")
)