How to count frequencies of different words from a dataframe column when the words are provided in a-CodePudding

I have a data frame df which is a web scrapped data of Jobs and Job descriptions from glassdoor and it looks like this -

Job Title	Job Description
Data Scientist	Descripton
Data Scientist	Descripton
Data Engineer	Descripton
Data Scientist	Descripton
Data Analyst	Descripton

It has columns for Job Title and Job Description. Job Descriptions are huge and have all the descriptions for a particular Job Title including the required skills somewhere in the middle of Job Description.

Screen Shot of one of the Job Description is below -

I am trying to get a dataframe with two columns, skills and frequency for the Job Title Data Scientist.

skills should be (list): Python, R, C/C , Java, Machine Learning, Deep Learning, NLP, Power BI, Tableau, Excel, spark, aws, MS Access, LMS, SAS.

Frequency: The number of times a skill is found in all Job descriptions for the Data Scientist Job Title

The output should be a dataframe that would look like this -

skills	frequency
python	243
R	109
power bi	183
tableau	201

and so on.

My take on this was:

(df
 .loc[df['Job Title'].eq('Data Scientist'), 'Job Description']
 .str.split(',\s*')
 .explode()
 .value_counts()
)

But this gives the frequency of each word from the description.

Another way I tried:

test = df['Job Description'].apply(lambda x: 1 if 'python' in x.lower()else 0)
test.value_counts()

This gives the count of skill but the problem with this approach is that the skills need to be fed one by one.

CodePudding user response：

I couldn't check the online file, but if I understand correctly you can use a regex to extract the terms;

l = ['Python', 'R', 'C/C  ', 'Java', 'Machine Learning', 'Deep Learning', 'NLP', 'Power BI', 'Tableau', 'Excel', 'spark', 'aws', 'MS Access', 'lLMS', 'SAS']

import re

regex = '|'.join(map(re.escape, l))
# 'Python|R|C/C\\ \\ |Java|Machine\\ Learning|Deep\\ Learning|NLP|Power\\ BI|Tableau|Excel|spark|aws|MS\\ Access|lLMS|SAS'

(df
 .set_index('Job Title') # set index what you want to group on later
 ['Job Description']
 .str.extractall(f'({regex})')[0]
 .groupby(level=0)
 .value_counts()
)