The following codes does exactly what I need, however it is very slow when dealing with large number of data (up to 100 000). How could it be improved ?
df = pd.DataFrame({
"session":["s1","s1","s1","s1","s2","s2","s2"],
"sub session":["a", "b", "d", "g", "f", "a", "x"],
"time":["2022-01-04 10:00:00", "2022-01-04 10:01:00", "2022-01-04 10:10:00", "2022-01-04 10:12:00",
"2022-01-04 15:25:00", "2022-01-04 15:30:00", "2022-01-04 15:45:00"]
})
print(df)
session sub session time
0 s1 a 2022-01-04 10:00:00
1 s1 b 2022-01-04 10:01:00
2 s1 d 2022-01-04 10:10:00
3 s1 g 2022-01-04 10:12:00
4 s2 f 2022-01-04 15:25:00
5 s2 a 2022-01-04 15:30:00
6 s2 x 2022-01-04 15:45:00
def func(serie):
arr = serie.to_list()
t0 = pd.to_datetime(str(arr[0]))
return [(pd.to_datetime(str(i))-t0).total_seconds()/60 for i in arr]
res = df.groupby(['session']).agg(
sub_session_path=("sub session", list),
path_length=("sub session", 'count'),
session_time=("time", func))
print(res)
sub_session_path path_length session_time
session
s1 [a, b, d, g] 4 [0.0, 1.0, 10.0, 12.0]
s2 [f, a, x] 3 [0.0, 5.0, 20.0]
CodePudding user response:
IIUC, intialize the time column as datetime only once and use vectorial code in your function:
df['time'] = pd.to_datetime(df['time'])
def func(s):
return (s-s.iloc[0]).dt.total_seconds().div(60).round(2).to_list()
res = df.groupby(['session']).agg(
sub_session_path=("sub session", list),
path_length=("sub session", 'count'),
session_time=("time", func))
output:
sub_session_path path_length session_time
session
s1 [a, b, d, g] 4 [0.0, 1.0, 10.0, 12.0]
s2 [f, a, x] 3 [0.0, 5.0, 20.0]