Home > Software engineering >  Regex expression to apply it on dataframe (convert a string of hour and minutes to sum of minutes) -
Regex expression to apply it on dataframe (convert a string of hour and minutes to sum of minutes) -

Time:11-05

I have a df and a column with strings that looks like following:

runtime
 1h 38m
 20h 4m
     5h
    45m
  empty

and I am trying to apply a function which will convert it to minutes.

So far, I have come up with part of it:

def runtime_to_minutes(string):
    
    try:
        capt_numbers = re.compile(r'[\d ][\d ]')
        hours = int(re.findall(capt_numbers, string)[0])
        minutes = int(re.findall(capt_numbers, string)[1])
        duration = hours * 60   minutes
        return duration
    except Exception as  error:
        return str(error)

which obviously cannot handle all the situations, although it won't work for '1h 38m' either as I get an error list index out of range when I do: df['minutes'] = df['runtime'].apply(lambda s: runtime_to_minutes(s))

How should I restructure the regex and the function to get the desired result?

CodePudding user response:

You can use

import pandas as pd
df = pd.DataFrame({'runtime':['1h 38m','20h 4m','5h','45m','empty']})
df[['hours', 'minutes']] = df['runtime'].str.extract(r'(?=\d \s*[hm]\b)(?:(\d )\s*h)?(?:\s*(\d )\s*m)?').fillna(0)
df['minutes'] = df['hours'].astype(int) * 60   df['minutes'].astype(int)
df.drop('hours', axis=1, inplace=True)
# => df
#    runtime  minutes
#  0  1h 38m       98
#  1  20h 4m     1204
#  2      5h      300
#  3     45m       45
#  4   empty        0

See the regex demo. The pattern extracts two captures, hours and minutes. Both parts are optional, but the lookahead makes sure at least one part is present.

  • (?=\d \s*[hm]\b) - a positive lookahead that requires one or more digits, zero or more whitespaces, and then h or m not followed with any other word char
  • (?:(\d )\s*h)? - an optional non-capturing group capturing one or more digits into Group 1, and then just matching zero or more whitespaces and h
  • (?:\s*(\d )\s*m)? - an optional non-capturing group matching zero or more whitespaces, then capturing one or more digits into Group 2, and then zero or more whitespaces and m are matched.

If no match occurs, .fillna(0) puts 0 as default value.

The hours and minutes are saved in hours and minutes columns.

Then, the minutes are calculated and hours column is dropped.

CodePudding user response:

def runtime_to_minutes(string):
    duration = 0
    capt_numbers = re.compile(r'(\d )\s*h')
    try:
        hours = int(re.findall(capt_numbers, string)[0])
    except:
        hours=0
    capt_numbers = re.compile(r'(\d )\s*m')
    try:
        minutes = int(re.findall(capt_numbers, string)[0])
    except:
        minutes = 0
    duration = int(hours) * 60   int(minutes)
    return duration

liste = [' 1h 38m',
' 20h 4m',
'     5h',
'    45m', 'empty']
for li in liste:
    time = runtime_to_minutes(li)
    print(time)


Results:
98
1204
300
45
0

I changed your function a little bit, now everything is working. I used 2 try/except blocks and set the number to zero in case of an Exception!

I forgot to add:

It will also work with pandas:

df = pd.DataFrame(liste)
df.columns = ['time']
df['time_2'] = df.time.apply(runtime_to_minutes)

      time  time_2
0   1h 38m      98
1   20h 4m    1204
2       5h     300
3      45m      45
4    empty       0
  • Related