How to count frequency of different elements of a column with a condition of another column element-CodePudding

I Have a data frame ‘df’ which looks like the table below.

Job Title	Job Description
Data Scientist	xxxxx.., Python, R, Machine Learning, Deep Learning, NLP, Power BI, Tableau, Excel, spark, aws, MS Access, LMS, SAS, xxxx
Data Scientist	xxxxx..Python, R, Machine Learning, Deep Learning, NLP, Spark...xxxx
Data Analyst	xxxx...Python, SQL, Power BI, Tableau, ...xxx
Data Analyst	xxxx...Python, SQL, Power BI, Excel, ...xxx
...	...

I want to count the frequency of each skills (python, R, etc) for the job title 'Data Scientist. The result should look like - | Skills | Frequency | | -------- | -------------- | | python | 846 | | R | 165 | | ... | ... |

My attempt for this was -

def freq (skill):
    count = 0
    if df["Job Title"]=="Data Scientist":
        for i in df["Job Description"]:
            if i == skill:
                count = count   1
                return count

This gives error, even if it wouldn't then each skill needs to be fed one by one within the function. Is there any more pythonic way to do this?

CodePudding user response：

IIUC, split, explode and groupby.value_counts:

(df
 .assign(**{'Job Description': df['Job Description'].str.split(',\s*')})
 .explode('Job Description')
 .groupby('Job Title')['Job Description'].value_counts()
)

output:

Job Title       Job Description 
Data Analyst    ...xxx              2
                Power BI            2
                SQL                 2
                xxxx...Python       2
                Excel               1
                Tableau             1
Data Scientist  Deep Learning       2
                Machine Learning    2
                NLP                 2
                ...
                xxxxx..Python       1
Name: Job Description, dtype: int64

If you're just interested in 'Data Scientist', subset first:

(df
 .loc[df['Job Title'].eq('Data Scientist'), 'Job Description']
 .str.split(',\s*')
 .explode()
 .value_counts()
)

output:

R                   2
Machine Learning    2
Deep Learning       2
NLP                 2
xxxxx..             1
MS Access           1
...
Power BI            1
Spark...xxxx        1
Name: Job Description, dtype: int64

CodePudding user response：

import pandas as pd

data = [
    ['Data Scientist', ['Python', 'R', 'Machine Learning', 'Deep Learning', 'NLP', 'Power BI', 'Tableau', 'Excel', 'spark', 'aws', 'MS Access', 'LMS', 'SAS']],
    ['Data Scientist', ['Python', 'R', 'Machine Learning', 'Deep Learning', 'NLP', 'Spark']],
    ['Data Analyst', ['Python', 'SQL', 'Power BI', 'Tableau']],
    ['Data Analyst', ['Python', 'SQL', 'Power BI', 'Excel']]
]

df = pd.DataFrame(data, columns=['Job Title', 'Job Description'])

df_ds = df[df['Job Title'] == 'Data Scientist']
skills = set().union(*[set(s) for s in df_ds['Job Description']])
result = pd.DataFrame(0, index=skills, columns=['Frequency'])
for desc in df_ds['Job Description']:
    result.loc[desc, :]  = 1
print(result)

prints

index	Frequency
Machine Learning	2
R	2
spark	1
SAS	1
aws	1
MS Access	1
Spark	1
NLP	2
Tableau	1
Python	2
Deep Learning	2
LMS	1
Excel	1
Power BI	1