I Have a data frame ‘df’ which looks like the table below.
Job Title | Job Description |
---|---|
Data Scientist | xxxxx.., Python, R, Machine Learning, Deep Learning, NLP, Power BI, Tableau, Excel, spark, aws, MS Access, LMS, SAS, xxxx |
Data Scientist | xxxxx..Python, R, Machine Learning, Deep Learning, NLP, Spark...xxxx |
Data Analyst | xxxx...Python, SQL, Power BI, Tableau, ...xxx |
Data Analyst | xxxx...Python, SQL, Power BI, Excel, ...xxx |
... | ... |
I want to count the frequency of each skills (python, R, etc) for the job title 'Data Scientist. The result should look like - | Skills | Frequency | | -------- | -------------- | | python | 846 | | R | 165 | | ... | ... |
My attempt for this was -
def freq (skill):
count = 0
if df["Job Title"]=="Data Scientist":
for i in df["Job Description"]:
if i == skill:
count = count 1
return count
This gives error, even if it wouldn't then each skill needs to be fed one by one within the function. Is there any more pythonic way to do this?
CodePudding user response:
IIUC, split
, explode
and groupby.value_counts
:
(df
.assign(**{'Job Description': df['Job Description'].str.split(',\s*')})
.explode('Job Description')
.groupby('Job Title')['Job Description'].value_counts()
)
output:
Job Title Job Description
Data Analyst ...xxx 2
Power BI 2
SQL 2
xxxx...Python 2
Excel 1
Tableau 1
Data Scientist Deep Learning 2
Machine Learning 2
NLP 2
...
xxxxx..Python 1
Name: Job Description, dtype: int64
If you're just interested in 'Data Scientist', subset first:
(df
.loc[df['Job Title'].eq('Data Scientist'), 'Job Description']
.str.split(',\s*')
.explode()
.value_counts()
)
output:
R 2
Machine Learning 2
Deep Learning 2
NLP 2
xxxxx.. 1
MS Access 1
...
Power BI 1
Spark...xxxx 1
Name: Job Description, dtype: int64
CodePudding user response:
import pandas as pd
data = [
['Data Scientist', ['Python', 'R', 'Machine Learning', 'Deep Learning', 'NLP', 'Power BI', 'Tableau', 'Excel', 'spark', 'aws', 'MS Access', 'LMS', 'SAS']],
['Data Scientist', ['Python', 'R', 'Machine Learning', 'Deep Learning', 'NLP', 'Spark']],
['Data Analyst', ['Python', 'SQL', 'Power BI', 'Tableau']],
['Data Analyst', ['Python', 'SQL', 'Power BI', 'Excel']]
]
df = pd.DataFrame(data, columns=['Job Title', 'Job Description'])
df_ds = df[df['Job Title'] == 'Data Scientist']
skills = set().union(*[set(s) for s in df_ds['Job Description']])
result = pd.DataFrame(0, index=skills, columns=['Frequency'])
for desc in df_ds['Job Description']:
result.loc[desc, :] = 1
print(result)
prints
index | Frequency |
---|---|
Machine Learning | 2 |
R | 2 |
spark | 1 |
SAS | 1 |
aws | 1 |
MS Access | 1 |
Spark | 1 |
NLP | 2 |
Tableau | 1 |
Python | 2 |
Deep Learning | 2 |
LMS | 1 |
Excel | 1 |
Power BI | 1 |