I have a data frame df which is a web scrapped data of Jobs and Job descriptions from glassdoor and it looks like this -
Job Title | Job Description |
---|---|
Data Scientist | Descripton |
Data Scientist | Descripton |
Data Engineer | Descripton |
Data Scientist | Descripton |
Data Analyst | Descripton |
It has columns for Job Title and Job Description. Job Descriptions are huge and have all the descriptions for a particular Job Title including the required skills somewhere in the middle of Job Description.
Screen Shot of one of the Job Description is below -
I am trying to get a dataframe with two columns, skills and frequency for the Job Title Data Scientist.
skills should be (list): Python, R, C/C , Java, Machine Learning, Deep Learning, NLP, Power BI, Tableau, Excel, spark, aws, MS Access, LMS, SAS.
Frequency: The number of times a skill is found in all Job descriptions for the Data Scientist Job Title
The output should be a dataframe that would look like this -
skills | frequency |
---|---|
python | 243 |
R | 109 |
power bi | 183 |
tableau | 201 |
and so on.
My take on this was:
(df
.loc[df['Job Title'].eq('Data Scientist'), 'Job Description']
.str.split(',\s*')
.explode()
.value_counts()
)
But this gives the frequency of each word from the description.
Another way I tried:
test = df['Job Description'].apply(lambda x: 1 if 'python' in x.lower()else 0)
test.value_counts()
This gives the count of skill but the problem with this approach is that the skills need to be fed one by one.
CodePudding user response:
I couldn't check the online file, but if I understand correctly you can use a regex to extract the terms;
l = ['Python', 'R', 'C/C ', 'Java', 'Machine Learning', 'Deep Learning', 'NLP', 'Power BI', 'Tableau', 'Excel', 'spark', 'aws', 'MS Access', 'lLMS', 'SAS']
import re
regex = '|'.join(map(re.escape, l))
# 'Python|R|C/C\\ \\ |Java|Machine\\ Learning|Deep\\ Learning|NLP|Power\\ BI|Tableau|Excel|spark|aws|MS\\ Access|lLMS|SAS'
(df
.set_index('Job Title') # set index what you want to group on later
['Job Description']
.str.extractall(f'({regex})')[0]
.groupby(level=0)
.value_counts()
)