How to clean a dataframe column with hours and minutes-CodePudding

I have a column of hours and minutes and I would like all values in the column to be in hours. So how do I divide only the columns values in minutes by 60 to get hours? I tried splitting the column by space to separate numbers and strings but I got stuck how to achieve the desire outcome.

CodePudding user response：

Using a lambda with split.

df["content_duration"] = df["content_duration"].apply(
    lambda x: round(int(x.split(" ")[0]) / 60, 2) if x.split(" ")[1] == "mins" else x.split(" ")[0]
)

print(df)

  content_duration
0              1.5
1                1
2              1.5
3                1
4             0.62
5             0.73

CodePudding user response：

Use the replace() function to replace the units with their respective conversions. Then apply the pandas eval function to each value to do the necessary conversions. Then round to the desired number of decimal places.

# Create the dataframe
df = pd.DataFrame({"content_duration": ['1.5 hours','1 hour','1.5 hours','1 hour', '37 mins','44 mins']})

# Convert the units to numeric datatype
df['content_duration'] = (df['content_duration'].replace({' mins?':'/60',' hours?':'*1'}, regex=True))\
                            .apply(pd.eval)\
                            .round(1)

# Print the dataframe
print(df)

OUTPUT:

   content_duration
0               1.5
1               1.0
2               1.5
3               1.0
4               0.6
5               0.7

CodePudding user response：

Pandas's to_timedelta is very good at converting this, you just need to remove the s from hours/mins:

df['hours'] = (pd.to_timedelta(df['content_duration']
                               .str.replace(r's\b', '', regex=True))
                .dt.total_seconds().div(3600)
                .round(2) # optional
              )

Output:

  content_duration  hours
0        1.5 hours   1.50
1           1 hour   1.00
2        1.5 hours   1.50
3           1 hour   1.00
4          37 mins   0.62
5          44 mins   0.73

To have strings:

df['hours'] = (pd.to_timedelta(df['content_duration'].str.replace(r's\b', '', regex=True))
                .dt.total_seconds().div(3600).round(2)
                .astype(str).add(' hours')
              )

output:

  content_duration       hours
0        1.5 hours   1.5 hours
1           1 hour   1.0 hours
2        1.5 hours   1.5 hours
3           1 hour   1.0 hours
4          37 mins  0.62 hours
5          44 mins  0.73 hours