To demonstrate my question, consider the following example. Lets assume I have a following dataframe
index | ignore_x | ignore_y | phrases |
---|---|---|---|
0 | 43 | 23 | cat eats mice |
1 | 1.3 | 33 | water is pure |
2 | 13 | 63 | machine learning |
3 | 15 | 35 | where there is a will, there is a way |
Now consider that I have certain words I want to form dummy variables only for those.
keywords = [cat, is]
To do that, separate columns are populated for each of the keyword
index | x_ignore | y_ignore | phrases | kw_cat | kw_is |
---|---|---|---|---|---|
0 | 43 | 23 | cat eats mice | 0 | 0 |
1 | 1.3 | 33 | water is pure | 0 | 0 |
2 | 13 | 63 | machine learning | 0 | 0 |
3 | 15 | 35 | where there is a will, there is a way | 0 | 0 |
Each phrase is scanned for the words and if there is a presence, the column returns True or get 1. (An alternative could be to count the occurrence as well, but lets keep it simple for now)
index | x_ignore | y_ignore | phrases | kw_cat | kw_is |
---|---|---|---|---|---|
0 | 43 | 23 | cat eats mice | 1 | 0 |
1 | 1.3 | 33 | water is pure | 0 | 1 |
2 | 13 | 63 | machine learning | 0 | 0 |
3 | 15 | 35 | where there is a will, there is a way | 0 | 1 |
What I've been trying? Loosely, I've been trying to do something like this
for row, element in enumerate(df):
for item in keywords:
if item in df['phrases'].str.split(' '):
df.loc[row, element] = 1
But this is not helping me out. It rather throws me a diagonal of 1s on those dummy variables.
Thanks :)
Edit: Just boldened the keywords to help you guys go through quickly :)
CodePudding user response:
You can use nltk.tokenizer.word_tokenize()
to split the sentences into word list
import nltk
keywords = ['cat', 'is']
tokenize = df['phrases'].apply(nltk.tokenize.word_tokenize)
print(tokenize)
0 [cat, eats, mice]
1 [water, is, pure]
2 [machine, learning]
3 [where, there, is, a, will, ,, there, is, a, way]
Then loop through keywords
and check if keyword in generated word list.
for keyword in keywords:
df[f'kw_{keyword}'] = tokenize.apply(lambda lst: int(keyword in lst))
print(df)
index ignore_x ignore_y phrases kw_cat kw_is
0 0 43.0 23 cat eats mice 1 0
1 1 1.3 33 water is pure 0 1
2 2 13.0 63 machine learning 0 0
3 3 15.0 35 where there is a will, there is a way 0 1
CodePudding user response:
Here is one solution to it. Since the phrases are strings, convert them to list under a new column (phrases2 in my case). Explode convert the list elements to separate row, which are filtered based on keywords. get_dummies convert the categorical data into columns and finally the duplicates are dropped
df2 = df
df2['phrases2'] = df2['phrases'].apply(lambda x: x.split(' ') )
df2=df2.explode('phrases2' )
df2=df2[df2['phrases2'].isin(keywords)]
pd.get_dummies(df2, columns=['phrases2']).drop_duplicates()
index ignore_x ignore_y phrases phrases2_cat phrases2_is
0 0 43 23 cat eats mice 1 0
1 1 1.3 33 water is pure 0 1
3 3 15 35 where there is a will, there is a way 0 1
CodePudding user response:
a_cat = df['phrases'].str.find('cat') != -1
a_is = df['phrases'].str.find('is') != -1
df.loc[df[a_cat == True].index, 'kw_cat'] = 1
df.loc[df[a_is == True].index, 'kw_is'] = 1
Output
index x_ignore ... kw_cat kw_is
0 0 43.0 ... 1 0
1 1 1.3 ... 0 1
2 2 13.0 ... 0 0
3 3 15.0 ... 0 1
Below is the code if there are many values.
keywords = ['cat', 'is']
ttt = 'kw_'
for i in keywords:
a = df['phrases'].str.find(i)
df.loc[df[a >= 0].index, ttt i] = 1
Here, a search for the necessary strings is used, which returns true or false, based on this, indexes are formed to set values.