Home > Software design >  Looking-up part of text in a column and if found true pass a Creating a new column by reflecting ass
Looking-up part of text in a column and if found true pass a Creating a new column by reflecting ass

Time:07-18

I want to create categories based on particular string of keyword found true than assigned category else "other".

For example - if "health" is found in a column, name that keyword row as "HEALTH", if "therapist" then 'THERAPIST'

  1. create a "category" column through code
  2. assign categories based on condition

I was able to do on Excel by creating a table and using index match and want to switch to Python to apply this on large dataset,

Below is the sample data,

keyword category
HR Consultancy UK-d-uk-159_bing other
it support COMPANY LONDON-D-UK-G1161_bing other
global sales training platform openings sales
tele private practice therapist therapist
asset grant management system other
digital team project management solution openings other
global training platform openings other
tele practice therapist therapist
global sales training platform openings sales
tele health practice health
asset grant management other
digital team project management solution other

CodePudding user response:

  1. Pandas has apply methods for this:
def f(s):
    if s.found('health')>=0:
        return 'health'
    if s.found('thera...')>=0:
        return 'thera'
    ...

df['category'] = df['text'].apply(f)

Read: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

  1. pandas has map methods for this. First create a dictionary and then use map methods.
d ={keyword1:category1, ...}
df['category'] = df['keyword'].map(d)

This is for when you have a specific dictionary. Read: https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html

CodePudding user response:

You can create first a list of keywords to test or research on it . Then the empty ones fill it with "other".

import pandas as pd

#example df
df=pd.DataFrame(data=['health xxxxx','yyyy therapist','zzzzz'],columns=["keyword"])

keywords=['health','therapist']


df['category']=df['keyword'].str.findall('|'.join(keywords)).apply(set).str.join(',')
df = df.apply(lambda x: x.str.strip()).replace('', "other")

CodePudding user response:

The easiest solution would be

import pandas as pd
df = your dataframe

assigning the default value first. You could set to None, or np.nan too

df['category'] = 'others'
df.loc[df.keyword.str.contains('therapist'),'category'] = 'therapist'
  • Related