I have a dataframe with a column showing time(in minutes) spent for organizing each inventory item. The goal is to show minutes spent in either integer or float. However, the value in this column is not clean, see some example below. Is there a way to standardized and convert everything to an integer or float? (For example, 10 hours should be 600 minutes)
import pandas as pd
df1 = { 'min':['420','450','480','512','560','10 hours', '10.5 hours',
'420 (all inventory)','3h ', '4.1 hours', '60**','6h', '7hours ']}
df1=pd.DataFrame(df1)
The desired output is like this
CodePudding user response:
I used regex
for this kind of problem.
import regex as re
import numpy as np
import pandas as pd
df1 = { 'min':['420','450','480','512','560','10 hours', '10.5 hours',
'420 (all inventory)','3h ', '4.1 hours', '60**','6h', '7hours ']}
df1=pd.DataFrame(df1)
# Copy Dataframe for iteration
# Created a empty numpy array for parsing by index
arr = np.zeros(df1.shape[0])
df1_copy = df1.copy()
for i,j in df1_copy.iterrows():
if "h" in j["min"]:
j["min"] = re.sub(r"[a-zA-Z()\s]","",j["min"])
j["min"] = float(j["min"])
arr[i] = float(j["min"]*60)
else:
j["min"] = re.sub(r"[a-zA-Z()**\s]","",j["min"])
j["min"] = float(j["min"])
arr[i] = float(j["min"])
df1["min_clean"] = arr
print(df1)
min min_clean
0 420 420.0
1 450 450.0
2 480 480.0
3 512 512.0
4 560 560.0
5 10 hours 600.0
6 10.5 hours 630.0
7 420 (all inventory) 420.0
8 3h 180.0
9 4.1 hours 246.0
10 60** 60.0
11 6h 360.0
12 7hours 420.0