Home > OS >  How iterate in a efficient way over Pandas dataframe with Numpy.vectorize?
How iterate in a efficient way over Pandas dataframe with Numpy.vectorize?

Time:11-02

I'm trying to iterate over a Pandas Dataframe using each row as a parameter function. I tried this:

def vectorize_df(df, hg):
   print(hg   str(df['tweets_id'])   df['tokenized_text'])

df = pd.DataFrame.from_records(belongs_node, columns=['tweets_id','tokenized_text'])
vfunct = numpy.vectorize(vectorize_df)
vfunct(df, "#Python")

The problem is when I do that, df parameter takes the value from 'tweets_id' instead of the all row. Thanks a lot :)

CodePudding user response:

When you define a function to be vectorized, then:

  • each column should be a separate parameter,
  • you should call it passing corresponding columns,
  • "other" parameters (not taken from the source array), should be marked as "excluded" parameters.

Another detail is that a vectorized function should not print anything, but it should return some value - the result of processing parameters from the current source row.

So you could e.g. proceed as follows

  1. Define your function as:

    def myFunct(col1, col2, hg):
        return f'{hg} / {col1} / {col2}'
    

    Don't use the word vectorize in the name of the function. For now it is an "ordinary" function. It will be vectorized in a moment.

  2. Create the vectorized function:

    vfunct = np.vectorize(myFunct, excluded=['hg'])
    
  3. And finally call it:

    vfunct(df.tweets_id, df.tokenized_text, '#Python')
    

The result, for my sample data, is:

array(['#Python / 101 / aaa bbb ccc ddd',
       '#Python / 102 / eee fff ggg hhh iii jjj',
       '#Python / 103 / kkk lll mmm nnn ooo ppp'], dtype='<U39')

It is up to what you do with this result. You may e.g. set it as a new column of your source DataFrame.

  • Related