Home > Mobile >  How to count frequency of different elements of a column with a condition of another column element
How to count frequency of different elements of a column with a condition of another column element

Time:08-11

I Have a data frame ‘df’ which looks like the table below.

Job Title Job Description
Data Scientist xxxxx.., Python, R, Machine Learning, Deep Learning, NLP, Power BI, Tableau, Excel, spark, aws, MS Access, LMS, SAS, xxxx
Data Scientist xxxxx..Python, R, Machine Learning, Deep Learning, NLP, Spark...xxxx
Data Analyst xxxx...Python, SQL, Power BI, Tableau, ...xxx
Data Analyst xxxx...Python, SQL, Power BI, Excel, ...xxx
... ...

I want to count the frequency of each skills (python, R, etc) for the job title 'Data Scientist. The result should look like - | Skills | Frequency | | -------- | -------------- | | python | 846 | | R | 165 | | ... | ... |

My attempt for this was -

def freq (skill):
    count = 0
    if df["Job Title"]=="Data Scientist":
        for i in df["Job Description"]:
            if i == skill:
                count = count   1
                return count

This gives error, even if it wouldn't then each skill needs to be fed one by one within the function. Is there any more pythonic way to do this?

CodePudding user response:

IIUC, split, explode and groupby.value_counts:

(df
 .assign(**{'Job Description': df['Job Description'].str.split(',\s*')})
 .explode('Job Description')
 .groupby('Job Title')['Job Description'].value_counts()
)

output:

Job Title       Job Description 
Data Analyst    ...xxx              2
                Power BI            2
                SQL                 2
                xxxx...Python       2
                Excel               1
                Tableau             1
Data Scientist  Deep Learning       2
                Machine Learning    2
                NLP                 2
                ...
                xxxxx..Python       1
Name: Job Description, dtype: int64

If you're just interested in 'Data Scientist', subset first:

(df
 .loc[df['Job Title'].eq('Data Scientist'), 'Job Description']
 .str.split(',\s*')
 .explode()
 .value_counts()
)

output:

R                   2
Machine Learning    2
Deep Learning       2
NLP                 2
xxxxx..             1
MS Access           1
...
Power BI            1
Spark...xxxx        1
Name: Job Description, dtype: int64

CodePudding user response:

import pandas as pd

data = [
    ['Data Scientist', ['Python', 'R', 'Machine Learning', 'Deep Learning', 'NLP', 'Power BI', 'Tableau', 'Excel', 'spark', 'aws', 'MS Access', 'LMS', 'SAS']],
    ['Data Scientist', ['Python', 'R', 'Machine Learning', 'Deep Learning', 'NLP', 'Spark']],
    ['Data Analyst', ['Python', 'SQL', 'Power BI', 'Tableau']],
    ['Data Analyst', ['Python', 'SQL', 'Power BI', 'Excel']]
]

df = pd.DataFrame(data, columns=['Job Title', 'Job Description'])

df_ds = df[df['Job Title'] == 'Data Scientist']
skills = set().union(*[set(s) for s in df_ds['Job Description']])
result = pd.DataFrame(0, index=skills, columns=['Frequency'])
for desc in df_ds['Job Description']:
    result.loc[desc, :]  = 1
print(result)

prints

index Frequency
Machine Learning 2
R 2
spark 1
SAS 1
aws 1
MS Access 1
Spark 1
NLP 2
Tableau 1
Python 2
Deep Learning 2
LMS 1
Excel 1
Power BI 1
  • Related