I have a df ,you can have it by copy and run the following code:
import pandas as pd
from io import StringIO
df = """
b_id duration1 duration2 user cases
366 NaN 38 days 22:05:06.807430 Test 4
367 0 days 00:00:05.285239 NaN Test 4
368 NaN NaN Test 4
366 NaN 38 days 22:05:06.807430 Test 4
466 NaN 38 days 22:05:06.807430 Tom 3
467 0 days 00:00:05.285239 NaN Tom 3
467 0 days 00:00:05.285239 NaN Tom 3
"""
df= pd.read_csv(StringIO(df.strip()), sep='\s\s ', engine='python')
df
Then I use the following code to get the mean value of duration1 and duration2 :
out = (df
.set_index('user')
.filter(like='duration')
.apply(pd.to_timedelta)
.groupby(level=0).mean()
.reset_index()
)
Output:
user duration1 duration2
0 Test 0 days 00:00:05.285239 38 days 22:05:06.807430
1 Tom 0 days 00:00:05.285239 38 days 22:05:06.807430
My question is how to keep the column 'cases' in the output.The correct output should looks like:
user duration1 duration2 cases
0 Test 0 days 00:00:05.285239 38 days 22:05:06.807430 4
1 Tom 0 days 00:00:05.285239 38 days 22:05:06.807430 3
CodePudding user response:
The most straightforward in my opinion is to use both columns as index/grouper. This way you still benefit from the easy conversion to timedelta.
out = (df
.set_index(['user', 'cases'])
.filter(like='duration')
.apply(pd.to_timedelta)
.groupby(level=[0,1]).mean()
.reset_index()
)
Output:
user cases duration1 duration2
0 Test 4 0 days 00:00:05.285239 38 days 22:05:06.807430
1 Tom 3 0 days 00:00:05.285239 38 days 22:05:06.807430