I have a list of keywords based on which I want to categorize the job description. Here are example list of keywords
manager = ["manager", "president", "management", "managing"]
assistant = ["assistant", "assisting"]
engineer = ["engineer", "engineering", "scientist", "architect"]
If a job description contains any of the keyword from the "manager" list, I would like to classify job description as manager regardless of whether it contains the other keywords or not. So in a way there is a hierarchy in different lists (manager > assistant > engineer) Example data frame:
job_description
Managing engineer is responsible for
This job entails assisting to
Engineer is required the execute
Pilot should be able to control
After the classification I would like to obtain the following data frame
job_description classification
Managing engineer is responsible for manager
This job entails assisting to engineers assistant
Engineer is required the execute engineer
Pilot should be able to control
I have more than three lists, is there an efficient way of classifying such text? Looking at previous answers, I wrote the following code but it does not work for text for some reason
import re
def get_dummy_vars(col, lsts):
out = []
len_lsts = len(lsts)
for row in col:
tmp = []
# in the nested loop, we use the any function to check for the first match
# if there's a match, break the loop and pad 0s since we don't care if there's another match
for lst in lsts:
tmp.append(int(any(True for x in lst if re.search(fr"\b{x}\b", row))))
if tmp[-1]:
break
tmp = [0] * (len_lsts - len(tmp))
out.append(tmp)
return out
df2 = df[~df['job_description'].isna()]
lsts = [manager, assistant, engineer]
out = df2.join(pd.DataFrame(get_dummy_vars(df2['cleaned'], lsts), columns=["manager", "assistant", "engineer"]))
CodePudding user response:
First I'd suggest to store your classifications in a dictionary, this will make it easier to retrieve the category name.
Then You can create your own function where you'll iterate over the dictionary items (the dict should be therefore organized by priority), and apply this function to the job_description
column:
cat_dict = {
"manager": ["manager", "president", "management", "managing"],
"assistant": ["assistant", "assisting"],
"engineer": ["engineer", "engineering", "scientist", "architect"]
}
def classify(desc):
for cat, lst in cat_dict.items():
if any(x in desc.lower() for x in lst):
return cat
df['classification'] = df["job_description"].apply(classify)
Output:
job_description classification
0 Managing engineer is responsible for manager
1 This job entails assisting to assistant
2 Engineer is required the execute engineer
3 Pilot should be able to control None
CodePudding user response:
I would consider using a nested np.where()
with regex. Using '|'.join(manager)
will create a pattern that will match when any of the keywords is present. You can try:
df['classification'] = np.where(df['job_description'].str.contains('|'.join(manager),case=False),'Manager',
np.where(df['job_description'].str.contains('|'.join(assistant),case=False),'Assistant',
np.where(df['job_description'].str.contains('|'.join(engineer),case=False),"Engineer","No Class")))
The order is important because it'll be hierarchy of the assignments.
Returning:
job_description classification
0 Managing engineer is responsible for Manager
1 This job entails assisting to Assistant
2 Engineer is required the execute Engineer
3 Pilot should be able to control No Class