Convert hours-minutes-seconds duration in dataframe to minutes-CodePudding

I have a dataframe with a column that represents time durations of two discrete events.

    Day             Duration
0   Mon    S: 3h0s, P: 18m0s
1  Tues    S: 3h0s, P: 18m0s
2   Wed    S: 4h0s, P: 18m0s
3  Thur   S: 30h0s, P: 10m0s
4   Fri      S: 15m, P: 3h0s

I want to split that duration into two distinct columns and consistently represent the time in minutes. Right now, it is shown in hours, minutes, and seconds, like S: 3h0s, P: 18m0s. So the output should look like this:

    Day             Duration   S(min) P(min)
0   Mon    S: 3h0s, P: 18m0s     180     18
1  Tues    S: 3h0s, P: 18m0s     180     18
2   Wed    S: 4h0s, P: 18m0s     240     18
3  Thur   S: 30h0s, P: 10m0s    1800     10
4   Fri      S: 15m, P: 3h0s      15    180

But when I do in str.replace

import pandas as pd

df = pd.read_csv("/file.csv")
df["S(min)"] = df['Duration'].str.split(',').str[0]
df["P(min)"] = df['Duration'].str.split(',').str[-1]

df['S(min)'] = df['S(min)'].str.replace("S: ", '').str.replace("h", '*60').str.replace('m','*1').str.replace('s','*(1/60)').apply(eval)
df['P(min)'] = df['P(min)'].str.replace("P: ", '').str.replace("h", '*60').str.replace('m','*1').str.replace('s','*(1/60)').apply(eval)

some of the calculations are off:

     Day           Duration  S(min)     P(min)
0    Mon  S: 3h0s, P: 18m0s    30.0   3.000000
1   Tues  S: 3h0s, P: 18m0s    30.0   3.000000
2    Wed  S: 4h0s, P: 18m0s    40.0   3.000000
3  Thurs   S: 30h, P: 10m0s  1800.0   1.666667
4    Fri    S: 15m, P: 3h0s    15.0  30.000000

CodePudding user response：

A possible solution:

df.assign(
    **df['Duration'].str.split(':|,', expand=True)[[1,3]]
    .apply(pd.to_timedelta)
    .apply(lambda x: x.dt.total_seconds().div(60))
    .rename({1: 'S(min)', 3: 'P(min)'}, axis=1))

Output:

    Day            Duration  S(min)  P(min)
0   Mon   S: 3h0s, P: 18m0s   180.0    18.0
1  Tues   S: 3h0s, P: 18m0s   180.0    18.0
2   Wed   S: 4h0s, P: 18m0s   240.0    18.0
3  Thur  S: 30h0s, P: 10m0s  1800.0    10.0
4   Fri     S: 15m, P: 3h0s    15.0   180.0

CodePudding user response：

An alternative approach using regex

df[['S', 'P']] = df['Duration'].str.extract(r'(S: .*?), P:( .*)')
df['S(min)'] = pd.to_timedelta(df['Duration'].str.replace("S: ", '').str.replace("P: ", '').str.split(',').str[0]).dt.total_seconds() / 60
df['P(min)'] = pd.to_timedelta(df['Duration'].str.replace("S: ", '').str.replace("P: ", '').str.split(',').str[-1]).dt.total_seconds() / 60
df.drop(['S', 'P'], axis=1, inplace=True)
print(df)

    Day            Duration  S(min)  P(min)
0   Mon   S: 3h0s, P: 18m0s   180.0    18.0
1  Tues   S: 3h0s, P: 18m0s   180.0    18.0
2   Wed   S: 4h0s, P: 18m0s   240.0    18.0
3  Thur  S: 30h0s, P: 10m0s  1800.0    10.0
4   Fri     S: 15m, P: 3h0s    15.0   180.0

CodePudding user response：

You forgot ' ' in '*60' and '*1' but this does not solve the problem if hour or minute are missing. You can also use a more complex regex to explode all values in different columns:

S_pat = r'S:\s*(?:(?P<S_h>\d )(?=h)h)?\s*(?:(?P<S_m>\d )(?=m)m)?\s*(?:(?P<S_s>\d )(?=s)s)?'
P_pat = r'P:\s*(?:(?P<P_h>\d )(?=h)h)?\s*(?:(?P<P_m>\d )(?=m)m)?\s*(?:(?P<P_s>\d )(?=s)s)?'

duration = df['Duration'].str.extract(SP_pat).fillna(0).astype(int)
duration = (duration.mul([60, 1, 1/60, 60, 1, 1/60])
                    .groupby(duration.columns.str[0], axis=1)
                    .sum().add_suffix('(min)'))

out = pd.concat([df, duration], axis=1)

Output:

>>> out
    Day            Duration  P(min)  S(min)
0   Mon   S: 3h0s, P: 18m0s    18.0   180.0
1  Tues   S: 3h0s, P: 18m0s    18.0   180.0
2   Wed   S: 4h0s, P: 18m0s    18.0   240.0
3  Thur  S: 30h0s, P: 10m0s    10.0  1800.0
4   Fri     S: 15m, P: 3h0s   180.0    15.0

# duration after extraction, before computing
>>> duration
  S_h S_m S_s P_h P_m P_s
0   3   0   0   0  18   0
1   3   0   0   0  18   0
2   4   0   0   0  18   0
3  30   0   0   0  10   0
4   0  15   0   3   0   0