I have a column in my pandas dataframe with the following values that represent hours worked in a week.
0 40
1 40h / week
2 46.25h/week on average
3 11
I would like to check every row, and if the length of the value is larger than 2 digits - extract the number of hours only from it. I have tried the following:
df['Hours_per_week'].apply(lambda x: (x.extract('(\d )') if(len(str(x)) > 2) else x))
However I am getting the AttributeError: 'str' object has no attribute 'extract' error.
CodePudding user response:
Assuming the series data are strings, try this:
df['Hours_per_week'].str.extract('(\d )')
CodePudding user response:
Why not immediately extract float pattern i.e. \d \.?\d
?
>>> s = pd.Series(['40', '40h / week', '46.25h/week on average', '11'])
>>> s.str.extract("(\d \.?\d )")
0
0 40
1 40
2 46.25
3 11
2 digits will still match either way.
CodePudding user response:
It looks like you could ensure having h
after the number:
df['Hours_per_week'].str.extract(r'(\d{2}\.?\d*)h', expand=False)
Output:
0 NaN
1 40
2 46.25
3 NaN
Name: Hours_per_week, dtype: object