I am building a matrix of 1 and -1 based on the words in an existing column. It is for a basic neural network. the data is setup as follows:
data_table data_table_final
index word_list index word_list cat hat mouse house run into hills your lives
0 the cat in the hat 0 the cat in the hat 1 1 -1 -1 -1 -1 -1 -1 -1
1 mouse in the house 1 mouse in the house -1 -1 1 1 -1 -1 -1 -1 -1
2 run into the hills --> 2 run into the hills -1 -1 -1 -1 1 1 1 -1 -1
3 run for your lives 3 run for your lives -1 -1 -1 -1 1 -1 -1 1 1
to generate this dataframe I perform the following:
word_list = ' '.join(sorted(data_table['world_list']))
stop_words = ['the','and', 'a', 'in','is', 'to', 'at','by', '']
pat = r'\b(?:{})\b'.format('|'.join(stop_words))
word_list = re.sub(pat,'',word_list)
word_list = word_list.split(' ')
word_list_2 = dict.fromkeys(word_list)
word_df = pd.DataFrame(-np.ones([len(data_table),len(word_list_2)]), columns=word_list_2)
if '' in word_df.keys():
word_df = word_df.drop(columns=(''))
word_df = pd.concat([data_table,word_df],1)
for idx, it in data_table_final.iterrows():
for word in it.word_list.split(" "):
data_table_final.loc[idx,word] = 1
This is rather slow when the number of entries in the table grows. And I think there should be a way to perform this without having to iterate over the rows of the dataframe. I thought about using zip
but all instance I have used it point to multiple outputs as opposed to a table of multiple columns.
Is there a more efficient way to perform the task without iterating over the table?
CodePudding user response:
Not sure if it's faster but you can try using collection.Counter
from collections import Counter
df = pd.DataFrame([
"the cat in the hat",
"mouse in the house",
"run into the hills",
"run for your lives"
])
df2 = pd.DataFrame(
df[0].map(
lambda x: dict(Counter(x.split()))
).to_list()).fillna(-1)
df2[df2>0] = 1
print(df2)
the cat in hat mouse house run into hills for your lives
0 1.0 1.0 1.0 1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0
1 1.0 -1.0 1.0 -1.0 1.0 1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0
2 1.0 -1.0 -1.0 -1.0 -1.0 -1.0 1.0 1.0 1.0 -1.0 -1.0 -1.0
3 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 1.0 -1.0 -1.0 1.0 1.0 1.0
You may filter out unnecessary columns then merge with the original dataframe.
By similar way, you can use pd.unique zip
. I think it could be slower as I've used map twice but there might be a faster way
df2 = pd.DataFrame(df[0].map(
lambda x: pd.unique(x.split())
).map(
lambda x: dict(zip(x, [1]*len(x)))
).to_list()).fillna(-1)
CodePudding user response:
Scikit-learn has a CountVectorizer
class that you can use for this purpose. By default, it returns the counts for each non stop-word as integers, but you can easily transform those into your 1/-1 encoding (I added the 'cat cat cat' string for testing):
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
stop_words = ['the', 'and', 'a', 'in', 'is', 'to', 'at', 'by', 'for']
data_table = pd.Series(['the cat in the hat',
'mouse in the house',
'run into the hills',
'run for your lives',
'cat cat cat'],
name='word_list')
vectorizer = CountVectorizer(stop_words=stop_words)
X = vectorizer.fit_transform(data_table)
word_df = pd.concat([data_table,
pd.DataFrame((X.toarray() > 0) * 2 - 1,
columns=vectorizer.get_feature_names_out())],
axis=1)
print(word_df)
word_list cat hat hills house into lives mouse run your
0 the cat in the hat 1 1 -1 -1 -1 -1 -1 -1 -1
1 mouse in the house -1 -1 -1 1 -1 -1 1 -1 -1
2 run into the hills -1 -1 1 -1 1 -1 -1 1 -1
3 run for your lives -1 -1 -1 -1 -1 1 -1 1 1
4 cat cat cat 1 -1 -1 -1 -1 -1 -1 -1 -1
As you can see, the columns are ordered lexicographically by default.