Index issues while data cleaning-CodePudding

Original code credit-Ken Jee

Salary = df['Salary Estimate'].apply(lambda x:x.split('(')[0])
minus_Kd = Salary.apply(lambda x:x.replace('K','').replace('$',''))

min_hr=minus_Kd.apply(lambda x:x.lower().replace('per hour;','').replace('employer 
provided salary:',''))

df['min_salary'] = min_hr.apply(lambda x: x.split('-')[0])
df['max_salary'] = min_hr.apply(lambda x: x.split('-')[1])
df['avg_salary']=(df.min_salary df.max_salary)/2

Having the following error when running this portion of my code:

  df['max_salary'] = min_hr.apply(lambda x: x.split('-')[1])
  IndexError: list index out of range

Here is some sample data that I am attempting to parse through:

Employer Provided Salary:$78K - $191K

$77K - $107K (Glassdoor est.)

Being a zero index, shouldn`t 0 and 1 be left/right of the "-" respectively?

CodePudding user response：

Using .str accessor with extract, regex and eval:

Given df,

df = pd.DataFrame({'Job':['Job 1', 'Job 2', 'Job 3'],
                   'Salary':['$78K - $191K', '$77K - $107K', '$100K']})

Input df:

     Job        Salary
0  Job 1  $78K - $191K
1  Job 2  $77K - $107K
2  Job 3         $100K

Option 1

Using regex

df['Salary'].str.extract('\$(?P<lower>\d )(?:.*\$(?P<higher>\d )K)?').astype('float').eval('average = (lower higher)/2')

Output:

   lower  higher  average
0   78.0   191.0    134.5
1   77.0   107.0     92.0
2  100.0     NaN      NaN

Option 2:

Using list comprehension:

df['Min Salary'] = [int(x.split('-')[0].strip().strip('\$|K')) for x in df['Salary']]
df['Max Salary'] = [int(x.split('-')[-1].strip().strip('\$|K')) for x in df['Salary']]
df['Avg Salary'] = (df['Min Salary']   df['Max Salary'])/2

Output:

     Job        Salary  Min Salary  Max Salary  Avg Salary
0  Job 1  $78K - $191K          78         191       134.5
1  Job 2  $77K - $107K          77         107        92.0
2  Job 3         $100K         100         100       100.0

Which leads me to the -1 index trick that way you don't have to worry about night having a second value if there isn't '-'.