Searching for certain keywords in pandas dataframe for classification-CodePudding

I have a list of keywords based on which I want to categorize the job description. Here are example list of keywords

manager = ["manager", "president", "management", "managing"]
assistant = ["assistant", "assisting"]
engineer = ["engineer", "engineering", "scientist", "architect"]

If a job description contains any of the keyword from the "manager" list, I would like to classify job description as manager regardless of whether it contains the other keywords or not. So in a way there is a hierarchy in different lists (manager > assistant > engineer) Example data frame:

job_description
Managing engineer is responsible for
This job entails assisting to
Engineer is required the execute
Pilot should be able to control

After the classification I would like to obtain the following data frame

 job_description                                 classification
    Managing engineer is responsible for             manager
    This job entails assisting to engineers          assistant     
    Engineer is required the execute                 engineer 
    Pilot should be able to control

I have more than three lists, is there an efficient way of classifying such text? Looking at previous answers, I wrote the following code but it does not work for text for some reason

import re

def get_dummy_vars(col, lsts):
    out = []
    len_lsts = len(lsts)
    for row in col:
        tmp = []
        # in the nested loop, we use the any function to check for the first match 
        # if there's a match, break the loop and pad 0s since we don't care if there's another match
        for lst in lsts:
            tmp.append(int(any(True for x in lst if re.search(fr"\b{x}\b", row))))
            if tmp[-1]:
                break
        tmp  = [0] * (len_lsts - len(tmp))
        out.append(tmp)
    return out

df2 = df[~df['job_description'].isna()]
lsts = [manager, assistant, engineer]
out = df2.join(pd.DataFrame(get_dummy_vars(df2['cleaned'], lsts), columns=["manager", "assistant", "engineer"]))

CodePudding user response：

First I'd suggest to store your classifications in a dictionary, this will make it easier to retrieve the category name. Then You can create your own function where you'll iterate over the dictionary items (the dict should be therefore organized by priority), and apply this function to the job_description column:

cat_dict = {
    "manager": ["manager", "president", "management", "managing"],
    "assistant": ["assistant", "assisting"],
    "engineer": ["engineer", "engineering", "scientist", "architect"]
}

def classify(desc):
    for cat, lst in cat_dict.items():
        if any(x in desc.lower() for x in lst):
            return cat

df['classification'] = df["job_description"].apply(classify)

Output:

                        job_description classification
0  Managing engineer is responsible for        manager
1         This job entails assisting to      assistant
2      Engineer is required the execute       engineer
3       Pilot should be able to control           None

CodePudding user response：

I would consider using a nested np.where() with regex. Using '|'.join(manager) will create a pattern that will match when any of the keywords is present. You can try:

df['classification'] = np.where(df['job_description'].str.contains('|'.join(manager),case=False),'Manager',
                                np.where(df['job_description'].str.contains('|'.join(assistant),case=False),'Assistant',
                                         np.where(df['job_description'].str.contains('|'.join(engineer),case=False),"Engineer","No Class")))

The order is important because it'll be hierarchy of the assignments.

Returning:

                        job_description classification
0  Managing engineer is responsible for        Manager
1         This job entails assisting to      Assistant
2      Engineer is required the execute       Engineer
3       Pilot should be able to control       No Class