I have a data frame df.
It has columns for Job Title and Job Description. Job Description has all the description for a particular Job Title including the required skills. I am trying to get a dataframe with two columns, skills and frequency for the role of 'Data Scientist. Skills should be: Python, R, C/C , Java, Machine Learning, Deep Learning, NLP, Power BI, Tableau, Excel, spark, aws, MS Access, LMS, SAS.
My take on this was:
(df
.loc[df['Job Title'].eq('Data Scientist'), 'Job Description']
.str.split(',\s*')
.explode()
.value_counts()
)
But this gives frequency of each word from the description.
Another way I tried:
test = df['Job Description'].apply(lambda x: 1 if 'python' in x.lower()else 0)
test.value_counts()
This gives count when each skill is fed one by one.
CodePudding user response:
I couldn't check the online file, but if I understand correctly you can use a regex to extract the terms;
l = ['Python', 'R', 'C/C ', 'Java', 'Machine Learning', 'Deep Learning', 'NLP', 'Power BI', 'Tableau', 'Excel', 'spark', 'aws', 'MS Access', 'lLMS', 'SAS']
import re
regex = '|'.join(map(re.escape, l))
# 'Python|R|C/C\\ \\ |Java|Machine\\ Learning|Deep\\ Learning|NLP|Power\\ BI|Tableau|Excel|spark|aws|MS\\ Access|lLMS|SAS'
(df
.set_index('Job Title') # set index what you want to group on later
['Job Description']
.str.extractall(f'({regex})')[0]
.groupby(level=0)
.value_counts()
)