Home > Blockchain >  more efficient column filling along rows in pandas dataframe
more efficient column filling along rows in pandas dataframe

Time:05-17

I am building a matrix of 1 and -1 based on the words in an existing column. It is for a basic neural network. the data is setup as follows:

data_table                                data_table_final
index            word_list                index            word_list   cat   hat  mouse  house  run into  hills your lives
    0   the cat in the hat                    0   the cat in the hat     1     1     -1     -1   -1   -1     -1    -1   -1
    1   mouse in the house                    1   mouse in the house    -1    -1      1      1   -1   -1     -1    -1   -1
    2   run into the hills         -->        2   run into the hills    -1    -1     -1     -1    1    1      1    -1   -1
    3   run for your lives                    3   run for your lives    -1    -1     -1     -1    1   -1     -1     1    1

to generate this dataframe I perform the following:

    word_list = ' '.join(sorted(data_table['world_list']))
    stop_words = ['the','and', 'a', 'in','is', 'to', 'at','by', '']

    pat = r'\b(?:{})\b'.format('|'.join(stop_words))
    word_list = re.sub(pat,'',word_list)

    word_list = word_list.split(' ')
    word_list_2 = dict.fromkeys(word_list)

    word_df = pd.DataFrame(-np.ones([len(data_table),len(word_list_2)]), columns=word_list_2)        
    if '' in word_df.keys():
        word_df = word_df.drop(columns=(''))
        
    word_df = pd.concat([data_table,word_df],1)

    for idx, it in data_table_final.iterrows():
        for word in it.word_list.split(" "):
            data_table_final.loc[idx,word] = 1

This is rather slow when the number of entries in the table grows. And I think there should be a way to perform this without having to iterate over the rows of the dataframe. I thought about using zip but all instance I have used it point to multiple outputs as opposed to a table of multiple columns.

Is there a more efficient way to perform the task without iterating over the table?

CodePudding user response:

Not sure if it's faster but you can try using collection.Counter

from collections import Counter

df = pd.DataFrame([
    "the cat in the hat", 
    "mouse in the house",
    "run into the hills",
    "run for your lives"
])            
df2 = pd.DataFrame(
    df[0].map(
        lambda x: dict(Counter(x.split()))
    ).to_list()).fillna(-1)
df2[df2>0] = 1
print(df2)
   the  cat   in  hat  mouse  house  run  into  hills  for  your  lives
0  1.0  1.0  1.0  1.0   -1.0   -1.0 -1.0  -1.0   -1.0 -1.0  -1.0   -1.0
1  1.0 -1.0  1.0 -1.0    1.0    1.0 -1.0  -1.0   -1.0 -1.0  -1.0   -1.0
2  1.0 -1.0 -1.0 -1.0   -1.0   -1.0  1.0   1.0    1.0 -1.0  -1.0   -1.0
3 -1.0 -1.0 -1.0 -1.0   -1.0   -1.0  1.0  -1.0   -1.0  1.0   1.0    1.0

You may filter out unnecessary columns then merge with the original dataframe.

By similar way, you can use pd.unique zip. I think it could be slower as I've used map twice but there might be a faster way

df2 = pd.DataFrame(df[0].map(
    lambda x: pd.unique(x.split())
).map(
    lambda x: dict(zip(x, [1]*len(x)))
).to_list()).fillna(-1)

CodePudding user response:

Scikit-learn has a CountVectorizer class that you can use for this purpose. By default, it returns the counts for each non stop-word as integers, but you can easily transform those into your 1/-1 encoding (I added the 'cat cat cat' string for testing):

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

stop_words = ['the', 'and', 'a', 'in', 'is', 'to', 'at', 'by', 'for']
data_table = pd.Series(['the cat in the hat',
                        'mouse in the house',
                        'run into the hills',
                        'run for your lives',
                        'cat cat cat'],
                       name='word_list')

vectorizer = CountVectorizer(stop_words=stop_words)
X = vectorizer.fit_transform(data_table)

word_df = pd.concat([data_table, 
                     pd.DataFrame((X.toarray() > 0) * 2 - 1, 
                         columns=vectorizer.get_feature_names_out())], 
                    axis=1)
print(word_df)
            word_list  cat  hat  hills  house  into  lives  mouse  run  your
0  the cat in the hat    1    1     -1     -1    -1     -1     -1   -1    -1
1  mouse in the house   -1   -1     -1      1    -1     -1      1   -1    -1
2  run into the hills   -1   -1      1     -1     1     -1     -1    1    -1
3  run for your lives   -1   -1     -1     -1    -1      1     -1    1     1
4         cat cat cat    1   -1     -1     -1    -1     -1     -1   -1    -1

As you can see, the columns are ordered lexicographically by default.

  • Related