Home > front end >  Using selected keywords from a string column to form one hot encoding type columns of pandas DataFra
Using selected keywords from a string column to form one hot encoding type columns of pandas DataFra

Time:05-04

To demonstrate my question, consider the following example. Lets assume I have a following dataframe

index ignore_x ignore_y phrases
0 43 23 cat eats mice
1 1.3 33 water is pure
2 13 63 machine learning
3 15 35 where there is a will, there is a way

Now consider that I have certain words I want to form dummy variables only for those.

keywords = [cat, is]

To do that, separate columns are populated for each of the keyword

index x_ignore y_ignore phrases kw_cat kw_is
0 43 23 cat eats mice 0 0
1 1.3 33 water is pure 0 0
2 13 63 machine learning 0 0
3 15 35 where there is a will, there is a way 0 0

Each phrase is scanned for the words and if there is a presence, the column returns True or get 1. (An alternative could be to count the occurrence as well, but lets keep it simple for now)

index x_ignore y_ignore phrases kw_cat kw_is
0 43 23 cat eats mice 1 0
1 1.3 33 water is pure 0 1
2 13 63 machine learning 0 0
3 15 35 where there is a will, there is a way 0 1

What I've been trying? Loosely, I've been trying to do something like this

for row, element in enumerate(df):
    for item in keywords:
        if item in df['phrases'].str.split(' '):
            df.loc[row, element] = 1

But this is not helping me out. It rather throws me a diagonal of 1s on those dummy variables.

Thanks :)

Edit: Just boldened the keywords to help you guys go through quickly :)

CodePudding user response:

You can use nltk.tokenizer.word_tokenize() to split the sentences into word list

import nltk

keywords = ['cat', 'is']

tokenize = df['phrases'].apply(nltk.tokenize.word_tokenize)
print(tokenize)

0                                    [cat, eats, mice]
1                                    [water, is, pure]
2                                  [machine, learning]
3    [where, there, is, a, will, ,, there, is, a, way]

Then loop through keywords and check if keyword in generated word list.

for keyword in keywords:
    df[f'kw_{keyword}'] = tokenize.apply(lambda lst: int(keyword in lst))
print(df)

   index  ignore_x  ignore_y                                phrases  kw_cat  kw_is
0      0      43.0        23                          cat eats mice       1      0
1      1       1.3        33                          water is pure       0      1
2      2      13.0        63                       machine learning       0      0
3      3      15.0        35  where there is a will, there is a way       0      1

CodePudding user response:

Here is one solution to it. Since the phrases are strings, convert them to list under a new column (phrases2 in my case). Explode convert the list elements to separate row, which are filtered based on keywords. get_dummies convert the categorical data into columns and finally the duplicates are dropped

df2 = df
df2['phrases2'] = df2['phrases'].apply(lambda x:   x.split(' ') )
df2=df2.explode('phrases2' )
df2=df2[df2['phrases2'].isin(keywords)]
pd.get_dummies(df2, columns=['phrases2']).drop_duplicates()
    index   ignore_x    ignore_y    phrases                             phrases2_cat phrases2_is
0       0   43            23        cat eats mice                               1     0
1       1   1.3           33        water is pure                               0     1
3       3   15            35        where there is a will, there is a way       0     1

CodePudding user response:

a_cat = df['phrases'].str.find('cat') != -1
a_is = df['phrases'].str.find('is') != -1

df.loc[df[a_cat == True].index, 'kw_cat'] = 1
df.loc[df[a_is == True].index, 'kw_is'] = 1

Output

   index  x_ignore  ...  kw_cat kw_is
0      0      43.0  ...       1     0
1      1       1.3  ...       0     1
2      2      13.0  ...       0     0
3      3      15.0  ...       0     1

Below is the code if there are many values.

keywords = ['cat', 'is']
ttt = 'kw_'
for i in keywords:
    a = df['phrases'].str.find(i)
    df.loc[df[a >= 0].index, ttt i] = 1

Here, a search for the necessary strings is used, which returns true or false, based on this, indexes are formed to set values.

  • Related