Original code credit-Ken Jee
Salary = df['Salary Estimate'].apply(lambda x:x.split('(')[0])
minus_Kd = Salary.apply(lambda x:x.replace('K','').replace('$',''))
min_hr=minus_Kd.apply(lambda x:x.lower().replace('per hour;','').replace('employer
provided salary:',''))
df['min_salary'] = min_hr.apply(lambda x: x.split('-')[0])
df['max_salary'] = min_hr.apply(lambda x: x.split('-')[1])
df['avg_salary']=(df.min_salary df.max_salary)/2
Having the following error when running this portion of my code:
df['max_salary'] = min_hr.apply(lambda x: x.split('-')[1])
IndexError: list index out of range
Here is some sample data that I am attempting to parse through:
Employer Provided Salary:$78K - $191K
$77K - $107K (Glassdoor est.)
Being a zero index, shouldn`t 0 and 1 be left/right of the "-" respectively?
CodePudding user response:
Using .str accessor with extract, regex and eval:
Given df,
df = pd.DataFrame({'Job':['Job 1', 'Job 2', 'Job 3'],
'Salary':['$78K - $191K', '$77K - $107K', '$100K']})
Input df:
Job Salary
0 Job 1 $78K - $191K
1 Job 2 $77K - $107K
2 Job 3 $100K
Option 1
Using regex
df['Salary'].str.extract('\$(?P<lower>\d )(?:.*\$(?P<higher>\d )K)?').astype('float').eval('average = (lower higher)/2')
Output:
lower higher average
0 78.0 191.0 134.5
1 77.0 107.0 92.0
2 100.0 NaN NaN
Option 2:
Using list comprehension:
df['Min Salary'] = [int(x.split('-')[0].strip().strip('\$|K')) for x in df['Salary']]
df['Max Salary'] = [int(x.split('-')[-1].strip().strip('\$|K')) for x in df['Salary']]
df['Avg Salary'] = (df['Min Salary'] df['Max Salary'])/2
Output:
Job Salary Min Salary Max Salary Avg Salary
0 Job 1 $78K - $191K 78 191 134.5
1 Job 2 $77K - $107K 77 107 92.0
2 Job 3 $100K 100 100 100.0
Which leads me to the -1 index trick that way you don't have to worry about night having a second value if there isn't '-'.