more efficient column filling along rows in pandas dataframe-CodePudding

I am building a matrix of 1 and -1 based on the words in an existing column. It is for a basic neural network. the data is setup as follows:

data_table                                data_table_final
index            word_list                index            word_list   cat   hat  mouse  house  run into  hills your lives
    0   the cat in the hat                    0   the cat in the hat     1     1     -1     -1   -1   -1     -1    -1   -1
    1   mouse in the house                    1   mouse in the house    -1    -1      1      1   -1   -1     -1    -1   -1
    2   run into the hills         -->        2   run into the hills    -1    -1     -1     -1    1    1      1    -1   -1
    3   run for your lives                    3   run for your lives    -1    -1     -1     -1    1   -1     -1     1    1

to generate this dataframe I perform the following:

    word_list = ' '.join(sorted(data_table['world_list']))
    stop_words = ['the','and', 'a', 'in','is', 'to', 'at','by', '']

    pat = r'\b(?:{})\b'.format('|'.join(stop_words))
    word_list = re.sub(pat,'',word_list)

    word_list = word_list.split(' ')
    word_list_2 = dict.fromkeys(word_list)

    word_df = pd.DataFrame(-np.ones([len(data_table),len(word_list_2)]), columns=word_list_2)        
    if '' in word_df.keys():
        word_df = word_df.drop(columns=(''))
        
    word_df = pd.concat([data_table,word_df],1)

    for idx, it in data_table_final.iterrows():
        for word in it.word_list.split(" "):
            data_table_final.loc[idx,word] = 1

This is rather slow when the number of entries in the table grows. And I think there should be a way to perform this without having to iterate over the rows of the dataframe. I thought about using zip but all instance I have used it point to multiple outputs as opposed to a table of multiple columns.

Is there a more efficient way to perform the task without iterating over the table?

CodePudding user response：

Not sure if it's faster but you can try using collection.Counter

from collections import Counter

df = pd.DataFrame([
    "the cat in the hat", 
    "mouse in the house",
    "run into the hills",
    "run for your lives"
])            
df2 = pd.DataFrame(
    df[0].map(
        lambda x: dict(Counter(x.split()))
    ).to_list()).fillna(-1)
df2[df2>0] = 1
print(df2)

   the  cat   in  hat  mouse  house  run  into  hills  for  your  lives
0  1.0  1.0  1.0  1.0   -1.0   -1.0 -1.0  -1.0   -1.0 -1.0  -1.0   -1.0
1  1.0 -1.0  1.0 -1.0    1.0    1.0 -1.0  -1.0   -1.0 -1.0  -1.0   -1.0
2  1.0 -1.0 -1.0 -1.0   -1.0   -1.0  1.0   1.0    1.0 -1.0  -1.0   -1.0
3 -1.0 -1.0 -1.0 -1.0   -1.0   -1.0  1.0  -1.0   -1.0  1.0   1.0    1.0

You may filter out unnecessary columns then merge with the original dataframe.

By similar way, you can use pd.unique zip. I think it could be slower as I've used map twice but there might be a faster way

df2 = pd.DataFrame(df[0].map(
    lambda x: pd.unique(x.split())
).map(
    lambda x: dict(zip(x, [1]*len(x)))
).to_list()).fillna(-1)

CodePudding user response：

Scikit-learn has a CountVectorizer class that you can use for this purpose. By default, it returns the counts for each non stop-word as integers, but you can easily transform those into your 1/-1 encoding (I added the 'cat cat cat' string for testing):

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

stop_words = ['the', 'and', 'a', 'in', 'is', 'to', 'at', 'by', 'for']
data_table = pd.Series(['the cat in the hat',
                        'mouse in the house',
                        'run into the hills',
                        'run for your lives',
                        'cat cat cat'],
                       name='word_list')

vectorizer = CountVectorizer(stop_words=stop_words)
X = vectorizer.fit_transform(data_table)

word_df = pd.concat([data_table, 
                     pd.DataFrame((X.toarray() > 0) * 2 - 1, 
                         columns=vectorizer.get_feature_names_out())], 
                    axis=1)
print(word_df)

            word_list  cat  hat  hills  house  into  lives  mouse  run  your
0  the cat in the hat    1    1     -1     -1    -1     -1     -1   -1    -1
1  mouse in the house   -1   -1     -1      1    -1     -1      1   -1    -1
2  run into the hills   -1   -1      1     -1     1     -1     -1    1    -1
3  run for your lives   -1   -1     -1     -1    -1      1     -1    1     1
4         cat cat cat    1   -1     -1     -1    -1     -1     -1   -1    -1

As you can see, the columns are ordered lexicographically by default.