Home > Software engineering >  Pandas finding a text in row and assign a dummy variable value based on this
Pandas finding a text in row and assign a dummy variable value based on this

Time:04-18

I have a data frame which contains a text column i.e. df["input"],

I would like to create a new variable which checks whether df["input"] column contains any of the word in a given list and assigns a value of 1 if previous dummy variable is equal to 0 (logic is 1) create a dummy variable that equals to zero 2) replace it to one if it contains any word in a given list and it was not contained in the previous lists.)

# Example lists
listings = ["amazon listing", "ecommerce", "products"]
scripting = ["subtitle",  "film", "dubbing"]
medical = ["medical", "biotechnology", "dentist"]

df = pd.DataFrame({'input': ['amazon listing subtitle', 
                             'medical', 
                             'film biotechnology dentist']})

which looks like:

input
amazon listing subtitle
medical 
film biotechnology dentist

final dataset should look like:

input                           listings  scripting  medical
amazon listing subtitle            1         0         0
medical                            0         0         1          
film biotechnology dentist         0         1         0

CodePudding user response:

One possible implementation is to use str.contains in a loop to create the 3 columns, then use idxmax to get the column name (or the list name) of the first match, then create a dummy variable from these matches:

import numpy as np
d = {'listings':listings, 'scripting':scripting, 'medical':medical}
for k,v in d.items():
    df[k] = df['input'].str.contains('|'.join(v))

arr = df[list(d)].to_numpy()
tmp = np.zeros(arr.shape, dtype='int8')
tmp[np.arange(len(arr)), arr.argmax(axis=1)] = arr.max(axis=1)
out = pd.DataFrame(tmp, columns=list(d)).combine_first(df)

But in this case, it might be more efficient to use a nested for-loop:

import re
def get_dummy_vars(col, lsts):
    out = []
    len_lsts = len(lsts)
    for row in col:
        tmp = []
        # in the nested loop, we use the any function to check for the first match 
        # if there's a match, break the loop and pad 0s since we don't care if there's another match
        for lst in lsts:
            tmp.append(int(any(True for x in lst if re.search(fr"\b{x}\b", row))))
            if tmp[-1]:
                break
        tmp  = [0] * (len_lsts - len(tmp))
        out.append(tmp)
    return out

lsts = [listings, scripting, medical]
out = df.join(pd.DataFrame(get_dummy_vars(df['input'], lsts), columns=['listings', 'scripting', 'medical']))

Output:

                        input listings medical scripting
0     amazon listing subtitle        1       0         0
1                     medical        0       1         0
2  film biotechnology dentist        0       0         1
  • Related