Count frequency of elements from a given list taken from a dataframe column based on a condition fro-CodePudding

I have a data frame df.

It has columns for Job Title and Job Description. Job Description has all the description for a particular Job Title including the required skills. I am trying to get a dataframe with two columns, skills and frequency for the role of 'Data Scientist. Skills should be: Python, R, C/C , Java, Machine Learning, Deep Learning, NLP, Power BI, Tableau, Excel, spark, aws, MS Access, LMS, SAS.

My take on this was:

(df
 .loc[df['Job Title'].eq('Data Scientist'), 'Job Description']
 .str.split(',\s*')
 .explode()
 .value_counts()
)

But this gives frequency of each word from the description.

Another way I tried:

test = df['Job Description'].apply(lambda x: 1 if 'python' in x.lower()else 0)
test.value_counts()

This gives count when each skill is fed one by one.

CodePudding user response：

I couldn't check the online file, but if I understand correctly you can use a regex to extract the terms;

l = ['Python', 'R', 'C/C  ', 'Java', 'Machine Learning', 'Deep Learning', 'NLP', 'Power BI', 'Tableau', 'Excel', 'spark', 'aws', 'MS Access', 'lLMS', 'SAS']

import re

regex = '|'.join(map(re.escape, l))
# 'Python|R|C/C\\ \\ |Java|Machine\\ Learning|Deep\\ Learning|NLP|Power\\ BI|Tableau|Excel|spark|aws|MS\\ Access|lLMS|SAS'

(df
 .set_index('Job Title') # set index what you want to group on later
 ['Job Description']
 .str.extractall(f'({regex})')[0]
 .groupby(level=0)
 .value_counts()
)