I have a df and a column with strings that looks like following:
runtime
1h 38m
20h 4m
5h
45m
empty
and I am trying to apply a function which will convert it to minutes.
So far, I have come up with part of it:
def runtime_to_minutes(string):
try:
capt_numbers = re.compile(r'[\d ][\d ]')
hours = int(re.findall(capt_numbers, string)[0])
minutes = int(re.findall(capt_numbers, string)[1])
duration = hours * 60 minutes
return duration
except Exception as error:
return str(error)
which obviously cannot handle all the situations, although it won't work for '1h 38m' either as I get an error list index out of range
when I do: df['minutes'] = df['runtime'].apply(lambda s: runtime_to_minutes(s))
How should I restructure the regex and the function to get the desired result?
CodePudding user response:
You can use
import pandas as pd
df = pd.DataFrame({'runtime':['1h 38m','20h 4m','5h','45m','empty']})
df[['hours', 'minutes']] = df['runtime'].str.extract(r'(?=\d \s*[hm]\b)(?:(\d )\s*h)?(?:\s*(\d )\s*m)?').fillna(0)
df['minutes'] = df['hours'].astype(int) * 60 df['minutes'].astype(int)
df.drop('hours', axis=1, inplace=True)
# => df
# runtime minutes
# 0 1h 38m 98
# 1 20h 4m 1204
# 2 5h 300
# 3 45m 45
# 4 empty 0
See the regex demo. The pattern extracts two captures, hours and minutes. Both parts are optional, but the lookahead makes sure at least one part is present.
(?=\d \s*[hm]\b)
- a positive lookahead that requires one or more digits, zero or more whitespaces, and thenh
orm
not followed with any other word char(?:(\d )\s*h)?
- an optional non-capturing group capturing one or more digits into Group 1, and then just matching zero or more whitespaces andh
(?:\s*(\d )\s*m)?
- an optional non-capturing group matching zero or more whitespaces, then capturing one or more digits into Group 2, and then zero or more whitespaces andm
are matched.
If no match occurs, .fillna(0)
puts 0
as default value.
The hours and minutes are saved in hours
and minutes
columns.
Then, the minutes are calculated and hours
column is dropped.
CodePudding user response:
def runtime_to_minutes(string):
duration = 0
capt_numbers = re.compile(r'(\d )\s*h')
try:
hours = int(re.findall(capt_numbers, string)[0])
except:
hours=0
capt_numbers = re.compile(r'(\d )\s*m')
try:
minutes = int(re.findall(capt_numbers, string)[0])
except:
minutes = 0
duration = int(hours) * 60 int(minutes)
return duration
liste = [' 1h 38m',
' 20h 4m',
' 5h',
' 45m', 'empty']
for li in liste:
time = runtime_to_minutes(li)
print(time)
Results:
98
1204
300
45
0
I changed your function a little bit, now everything is working. I used 2 try/except blocks and set the number to zero in case of an Exception!
I forgot to add:
It will also work with pandas:
df = pd.DataFrame(liste)
df.columns = ['time']
df['time_2'] = df.time.apply(runtime_to_minutes)
time time_2
0 1h 38m 98
1 20h 4m 1204
2 5h 300
3 45m 45
4 empty 0