I want to create categories based on particular string of keyword found true than assigned category else "other".
For example - if "health" is found in a column, name that keyword row as "HEALTH", if "therapist" then 'THERAPIST'
- create a "category" column through code
- assign categories based on condition
I was able to do on Excel by creating a table and using index match and want to switch to Python to apply this on large dataset,
Below is the sample data,
keyword | category |
---|---|
HR Consultancy UK-d-uk-159_bing | other |
it support COMPANY LONDON-D-UK-G1161_bing | other |
global sales training platform openings | sales |
tele private practice therapist | therapist |
asset grant management system | other |
digital team project management solution openings | other |
global training platform openings | other |
tele practice therapist | therapist |
global sales training platform openings | sales |
tele health practice | health |
asset grant management | other |
digital team project management solution | other |
CodePudding user response:
- Pandas has apply methods for this:
def f(s):
if s.found('health')>=0:
return 'health'
if s.found('thera...')>=0:
return 'thera'
...
df['category'] = df['text'].apply(f)
Read: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
- pandas has map methods for this. First create a dictionary and then use map methods.
d ={keyword1:category1, ...}
df['category'] = df['keyword'].map(d)
This is for when you have a specific dictionary. Read: https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html
CodePudding user response:
You can create first a list of keywords to test or research on it . Then the empty ones fill it with "other".
import pandas as pd
#example df
df=pd.DataFrame(data=['health xxxxx','yyyy therapist','zzzzz'],columns=["keyword"])
keywords=['health','therapist']
df['category']=df['keyword'].str.findall('|'.join(keywords)).apply(set).str.join(',')
df = df.apply(lambda x: x.str.strip()).replace('', "other")
CodePudding user response:
The easiest solution would be
import pandas as pd
df = your dataframe
assigning the default value first. You could set to None, or np.nan too
df['category'] = 'others'
df.loc[df.keyword.str.contains('therapist'),'category'] = 'therapist'