Looking-up part of text in a column and if found true pass a Creating a new column by reflecting ass-CodePudding

I want to create categories based on particular string of keyword found true than assigned category else "other".

For example - if "health" is found in a column, name that keyword row as "HEALTH", if "therapist" then 'THERAPIST'

create a "category" column through code
assign categories based on condition

I was able to do on Excel by creating a table and using index match and want to switch to Python to apply this on large dataset,

Below is the sample data,

keyword	category
HR Consultancy UK-d-uk-159_bing	other
it support COMPANY LONDON-D-UK-G1161_bing	other
global sales training platform openings	sales
tele private practice therapist	therapist
asset grant management system	other
digital team project management solution openings	other
global training platform openings	other
tele practice therapist	therapist
global sales training platform openings	sales
tele health practice	health
asset grant management	other
digital team project management solution	other

CodePudding user response：

Pandas has apply methods for this:

def f(s):
    if s.found('health')>=0:
        return 'health'
    if s.found('thera...')>=0:
        return 'thera'
    ...

df['category'] = df['text'].apply(f)

Read: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

pandas has map methods for this. First create a dictionary and then use map methods.

d ={keyword1:category1, ...}
df['category'] = df['keyword'].map(d)

This is for when you have a specific dictionary. Read: https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html

CodePudding user response：

You can create first a list of keywords to test or research on it . Then the empty ones fill it with "other".

import pandas as pd

#example df
df=pd.DataFrame(data=['health xxxxx','yyyy therapist','zzzzz'],columns=["keyword"])

keywords=['health','therapist']


df['category']=df['keyword'].str.findall('|'.join(keywords)).apply(set).str.join(',')
df = df.apply(lambda x: x.str.strip()).replace('', "other")

CodePudding user response：

The easiest solution would be

import pandas as pd
df = your dataframe

assigning the default value first. You could set to None, or np.nan too

df['category'] = 'others'
df.loc[df.keyword.str.contains('therapist'),'category'] = 'therapist'