Home > Software engineering >  Pandas aggregate with self written function: optimisation issue
Pandas aggregate with self written function: optimisation issue

Time:05-03

The following codes does exactly what I need, however it is very slow when dealing with large number of data (up to 100 000). How could it be improved ?

df = pd.DataFrame({
        "session":["s1","s1","s1","s1","s2","s2","s2"], 
        "sub session":["a", "b", "d", "g", "f", "a", "x"],
        "time":["2022-01-04 10:00:00", "2022-01-04 10:01:00", "2022-01-04 10:10:00", "2022-01-04 10:12:00",
            "2022-01-04 15:25:00", "2022-01-04 15:30:00", "2022-01-04 15:45:00"]
        })

print(df)

    session sub session time
0   s1  a   2022-01-04 10:00:00
1   s1  b   2022-01-04 10:01:00
2   s1  d   2022-01-04 10:10:00
3   s1  g   2022-01-04 10:12:00
4   s2  f   2022-01-04 15:25:00
5   s2  a   2022-01-04 15:30:00
6   s2  x   2022-01-04 15:45:00


def func(serie):
    arr = serie.to_list()
    t0 = pd.to_datetime(str(arr[0]))
    return [(pd.to_datetime(str(i))-t0).total_seconds()/60 for i in arr]

res = df.groupby(['session']).agg(
            sub_session_path=("sub session", list),
            path_length=("sub session", 'count'),
            session_time=("time", func))

print(res)

        sub_session_path  path_length            session_time
session                                                      
s1          [a, b, d, g]            4  [0.0, 1.0, 10.0, 12.0]
s2             [f, a, x]            3        [0.0, 5.0, 20.0]

CodePudding user response:

IIUC, intialize the time column as datetime only once and use vectorial code in your function:

df['time'] = pd.to_datetime(df['time'])

def func(s):
    return (s-s.iloc[0]).dt.total_seconds().div(60).round(2).to_list()

res = df.groupby(['session']).agg(
            sub_session_path=("sub session", list),
            path_length=("sub session", 'count'),
            session_time=("time", func))

output:

        sub_session_path  path_length            session_time
session                                                      
s1          [a, b, d, g]            4  [0.0, 1.0, 10.0, 12.0]
s2             [f, a, x]            3        [0.0, 5.0, 20.0]
  • Related