I have a large panel data set that includes job descriptions. I would like to extract the wages/salaries from the job descriptions. However, there is a lot of variability in how the salaries are stated in the job descriptions. Here are a few examples:
“The salary range in Colorado for this role is from USD $123,500 - $185,500”
“The salary for this role is $180,000 to $216,000”
“The salary for this role in the state of Colorado is between $150,800 to $226,000.”
“Pay Range: $12.00 - $16.00”
“Salary Range: $180,000 - $147,000”
“The anticipated starting base pay for this position is: $100,000 to $142,000 per year”
“Hourly wage estimate: $21.49 - $32.24 / hour”
In addition, sometimes the job description will include a "$" sign referring to a company's budget or market value, so it isn't as simple as just taking the information after the dollar sign.
I think the best way to go about this would be to use regular expressions. I think if I create a comprehensive set of key phrases (e.g., "The salary range for this role is", "Salary Range:", "Pay Range:", "The anticipated base pay for this position is:", etc.) that come before the salary information, I could then grab the pay information that comes after.
Here is the code I have come up with:
pattern = r'\Starting\s Pay\s Range\: | \Salary\s Range\: | \s Pay\Range\:
pd_00['salary_info'] = pd_00['job_description'].str.extract(pattern, re.IGNORECASE, expand=False)
My issue is that I do not know the best way to go about pulling the salary information that comes after the set of key phrases. If you look at the above examples, sometimes the information has a "-" between the range, and sometimes it has a "to". Also, sometimes there are decimals in the dollar values, and sometimes there is no decimal. Any help would be greatly appreciated!
CodePudding user response:
Try this:
You can access the salary by accessing the value inside the first capturing group \1
.*?(?:\b(?:[sS]alary|[wW]age|[pP]ay)\b).*?([$][\d,.] \s*[ot-] \s*[\d$,.] \d )
See regex demo.
CodePudding user response:
Given the fact that in your answer you are stating that not all possible texts are contained in your example (for example using a text such as "The budget this year will be from $150,000 to $200,000") a regex will in my opinon not be the best approach for this issue. Under a NLP approach, you can use transformers question-answering pipeline:
Question Answering models can retrieve the answer to a question from a given text, which is useful for searching for an answer in a document. Some question answering models can generate answers without context!
from transformers import pipeline
qa_model = pipeline("question-answering")
And by using the question:
question = "What is the pay or salary range for the role?"
You can then use the following texts:
texts = ["The salary range in Colorado for this role is from USD $123,500 - $185,500",
"The salary for this role is $180,000 to $216,000",
"The salary for this role in the state of Colorado is between $150,800 to $226,000.",
"Pay Range: $12.00 - $16.00",
"Salary Range: $180,000 - $147,000",
"The anticipated starting base pay for this position is: $100,000 to $142,000 per year",
"Hourly wage estimate: $21.49 - $32.24 / hour",
"The budget this year will be from $150,000 to $200,000"]
As input for the qa_model:
ranges = [qa_model(question = question, context = x)['answer'] for x in texts if qa_model(question = question, context = x)['score'] > 0.45]
To return:
['USD $123,500 - $185,500',
'$180,000 to $216,000',
'between $150,800 to $226,000',
'$12.00 - $16.00',
'$180,000 - $147,000',
'$100,000 to $142,000 per year',
'$21.49 - $32.24 / hour']
Note I am using an arbitrary threshold of 0.45 to filter other sentences which may have ranges but not regarding pay or salary, such as the last element in the texts
variable. Please adjust accordingly.
If you were to need to apply this to a dataframe (named df
for example) where the column "job description" contained the text, you can try:
pd_00['salary_range'] = pd_00['job_descriptiom'].map(lambda x: "No pay range detected" if (qa_model(question = question, context = x)['score'] > 0.45) else qa_model(question = question, context = x)['answer'])