Home > Software engineering >  How to count frequencies of different words from a dataframe column when the words are provided in a
How to count frequencies of different words from a dataframe column when the words are provided in a

Time:08-14

I have a data frame df which is a web scrapped data of Jobs and Job descriptions from glassdoor and it looks like this -

Job Title Job Description
Data Scientist Descripton
Data Scientist Descripton
Data Engineer Descripton
Data Scientist Descripton
Data Analyst Descripton

It has columns for Job Title and Job Description. Job Descriptions are huge and have all the descriptions for a particular Job Title including the required skills somewhere in the middle of Job Description.

Screen Shot of one of the Job Description is below -

enter image description here

I am trying to get a dataframe with two columns, skills and frequency for the Job Title Data Scientist.

skills should be (list): Python, R, C/C , Java, Machine Learning, Deep Learning, NLP, Power BI, Tableau, Excel, spark, aws, MS Access, LMS, SAS.

Frequency: The number of times a skill is found in all Job descriptions for the Data Scientist Job Title

The output should be a dataframe that would look like this -

skills frequency
python 243
R 109
power bi 183
tableau 201

and so on.

My take on this was:

(df
 .loc[df['Job Title'].eq('Data Scientist'), 'Job Description']
 .str.split(',\s*')
 .explode()
 .value_counts()
)

But this gives the frequency of each word from the description.

Another way I tried:

test = df['Job Description'].apply(lambda x: 1 if 'python' in x.lower()else 0)
test.value_counts()

This gives the count of skill but the problem with this approach is that the skills need to be fed one by one.

CodePudding user response:

I couldn't check the online file, but if I understand correctly you can use a regex to extract the terms;

l = ['Python', 'R', 'C/C  ', 'Java', 'Machine Learning', 'Deep Learning', 'NLP', 'Power BI', 'Tableau', 'Excel', 'spark', 'aws', 'MS Access', 'lLMS', 'SAS']

import re

regex = '|'.join(map(re.escape, l))
# 'Python|R|C/C\\ \\ |Java|Machine\\ Learning|Deep\\ Learning|NLP|Power\\ BI|Tableau|Excel|spark|aws|MS\\ Access|lLMS|SAS'

(df
 .set_index('Job Title') # set index what you want to group on later
 ['Job Description']
 .str.extractall(f'({regex})')[0]
 .groupby(level=0)
 .value_counts()
)
  • Related