I have a dataframe with a column that represents time durations of two discrete events.
Day Duration
0 Mon S: 3h0s, P: 18m0s
1 Tues S: 3h0s, P: 18m0s
2 Wed S: 4h0s, P: 18m0s
3 Thur S: 30h0s, P: 10m0s
4 Fri S: 15m, P: 3h0s
I want to split that duration into two distinct columns and consistently represent the time in minutes
. Right now, it is shown in hours
, minutes
, and seconds
, like S: 3h0s, P: 18m0s
. So the output should look like this:
Day Duration S(min) P(min)
0 Mon S: 3h0s, P: 18m0s 180 18
1 Tues S: 3h0s, P: 18m0s 180 18
2 Wed S: 4h0s, P: 18m0s 240 18
3 Thur S: 30h0s, P: 10m0s 1800 10
4 Fri S: 15m, P: 3h0s 15 180
But when I do in str.replace
import pandas as pd
df = pd.read_csv("/file.csv")
df["S(min)"] = df['Duration'].str.split(',').str[0]
df["P(min)"] = df['Duration'].str.split(',').str[-1]
df['S(min)'] = df['S(min)'].str.replace("S: ", '').str.replace("h", '*60').str.replace('m','*1').str.replace('s','*(1/60)').apply(eval)
df['P(min)'] = df['P(min)'].str.replace("P: ", '').str.replace("h", '*60').str.replace('m','*1').str.replace('s','*(1/60)').apply(eval)
some of the calculations are off:
Day Duration S(min) P(min)
0 Mon S: 3h0s, P: 18m0s 30.0 3.000000
1 Tues S: 3h0s, P: 18m0s 30.0 3.000000
2 Wed S: 4h0s, P: 18m0s 40.0 3.000000
3 Thurs S: 30h, P: 10m0s 1800.0 1.666667
4 Fri S: 15m, P: 3h0s 15.0 30.000000
CodePudding user response:
A possible solution:
df.assign(
**df['Duration'].str.split(':|,', expand=True)[[1,3]]
.apply(pd.to_timedelta)
.apply(lambda x: x.dt.total_seconds().div(60))
.rename({1: 'S(min)', 3: 'P(min)'}, axis=1))
Output:
Day Duration S(min) P(min)
0 Mon S: 3h0s, P: 18m0s 180.0 18.0
1 Tues S: 3h0s, P: 18m0s 180.0 18.0
2 Wed S: 4h0s, P: 18m0s 240.0 18.0
3 Thur S: 30h0s, P: 10m0s 1800.0 10.0
4 Fri S: 15m, P: 3h0s 15.0 180.0
CodePudding user response:
An alternative approach using regex
df[['S', 'P']] = df['Duration'].str.extract(r'(S: .*?), P:( .*)')
df['S(min)'] = pd.to_timedelta(df['Duration'].str.replace("S: ", '').str.replace("P: ", '').str.split(',').str[0]).dt.total_seconds() / 60
df['P(min)'] = pd.to_timedelta(df['Duration'].str.replace("S: ", '').str.replace("P: ", '').str.split(',').str[-1]).dt.total_seconds() / 60
df.drop(['S', 'P'], axis=1, inplace=True)
print(df)
Day Duration S(min) P(min)
0 Mon S: 3h0s, P: 18m0s 180.0 18.0
1 Tues S: 3h0s, P: 18m0s 180.0 18.0
2 Wed S: 4h0s, P: 18m0s 240.0 18.0
3 Thur S: 30h0s, P: 10m0s 1800.0 10.0
4 Fri S: 15m, P: 3h0s 15.0 180.0
CodePudding user response:
You forgot ' ' in '*60' and '*1' but this does not solve the problem if hour or minute are missing. You can also use a more complex regex to explode all values in different columns:
S_pat = r'S:\s*(?:(?P<S_h>\d )(?=h)h)?\s*(?:(?P<S_m>\d )(?=m)m)?\s*(?:(?P<S_s>\d )(?=s)s)?'
P_pat = r'P:\s*(?:(?P<P_h>\d )(?=h)h)?\s*(?:(?P<P_m>\d )(?=m)m)?\s*(?:(?P<P_s>\d )(?=s)s)?'
duration = df['Duration'].str.extract(SP_pat).fillna(0).astype(int)
duration = (duration.mul([60, 1, 1/60, 60, 1, 1/60])
.groupby(duration.columns.str[0], axis=1)
.sum().add_suffix('(min)'))
out = pd.concat([df, duration], axis=1)
Output:
>>> out
Day Duration P(min) S(min)
0 Mon S: 3h0s, P: 18m0s 18.0 180.0
1 Tues S: 3h0s, P: 18m0s 18.0 180.0
2 Wed S: 4h0s, P: 18m0s 18.0 240.0
3 Thur S: 30h0s, P: 10m0s 10.0 1800.0
4 Fri S: 15m, P: 3h0s 180.0 15.0
# duration after extraction, before computing
>>> duration
S_h S_m S_s P_h P_m P_s
0 3 0 0 0 18 0
1 3 0 0 0 18 0
2 4 0 0 0 18 0
3 30 0 0 0 10 0
4 0 15 0 3 0 0