Improve speed of filling pandas dataframe with data-CodePudding

I'm trying to find a more optimized way to add data to a pandas dataframe. I already saw other related questions where people suggested to first create lists and then add the data to pandas (which I now implemented).

In my current setup I loop through different lists (in the example it is librarynr, books and sections) and then compute various variables (in the example those are not computed but already set; nrofletters, excitment and review) which I add to lists and in the end add the lists to the dataframe.

Does anyone know of further optimizations to improve performance on this example code?

Important note: In my final code, the variables are not the same for all rows, but computed depending on the iterators of the loops (see example calculation of excitment).

Example code:

import pandas as pd
import time


books = ['LordOfTheRings','HarryPotter','LoveStory','RandomBook']
sections = ['Introduction','MainPart','Plottwist','SurprisingEnd']

librarynr = list(range(30000))
nrofletters = 3000
excitment = True
review = 'positive'


start_time = time.time()
summarydf = pd.DataFrame()

indexlist = []
nrofletterlist = []
excitmentlist = []
reviewlist = []

for library in librarynr:
    for book in books:
        for section in sections:
            indexlist.append(str(library) book section)
            nrofletterlist.append(nrofletters)
            
            #example of variable calculation depending on iterators of loop:
            if (library % 2 == 0) or (book[1] == 'L'):
                excitment = False
            else:
                excitment = True
                
            excitmentlist.append(excitment)
            reviewlist.append(review)
            
summarydf['index'] = indexlist
summarydf['nrofletters'] = nrofletterlist
summarydf['excitment'] = excitmentlist
summarydf['review'] = reviewlist
listtime = time.time() - start_time
print(listtime)

CodePudding user response：

Append is very slow, you should produce your DataFrame in one shot.

IIUC, you want a product of all possibilities. You can use itertools.product

I am giving here an example with only librarynr = 5. Your condition with librarynr = 300000 would produce 4.8 millions rows.

from itertools import product

librarynr = 5

idx = map(''.join, product(map(str, range(librarynr)), books, sections))

df = pd.DataFrame([], index=idx)

df[['nrofletters', 'excitment', 'review']] = [3000, True, 'positive']

Output:

>>> print(df.reset_index())

                           index  nrofletters  excitment    review
0    0LordOfTheRingsIntroduction         3000       True  positive
1        0LordOfTheRingsMainPart         3000       True  positive
2       0LordOfTheRingsPlottwist         3000       True  positive
3   0LordOfTheRingsSurprisingEnd         3000       True  positive
4       0HarryPotterIntroduction         3000       True  positive
..                           ...          ...        ...       ...
75       4LoveStorySurprisingEnd         3000       True  positive
76       4RandomBookIntroduction         3000       True  positive
77           4RandomBookMainPart         3000       True  positive
78          4RandomBookPlottwist         3000       True  positive
79      4RandomBookSurprisingEnd         3000       True  positive

[80 rows x 4 columns]

CodePudding user response：

CPython loops are very slow due to the interpreter (note that there are faster Python interpreter like PyPy which use a JIT compiler). You can use comprehension list to speed up significantly the loops. Using itertools can help a bit more as well as converting library to a string in the outermost loop. However, the result is still not very fast regarding the performed operation.

Another issue comes from the conversion from Python list to Numpy arrays. Indeed, Pandas use Numpy internally and Numpy convert each string reference to a static bounded string (so using a big raw memory buffer and not an array of reference-counted objects). This means that each strings are parsed and copied by Numpy which is very expensive. The best solution is to directly write the Numpy array using vectorized functions if possible. If this is not possible, you can use Numba. However, note that Numba barely supports string arrays so far. Another possible solution is you use Cython. Using a direct assignment of Pandas can be very fast too to set all the strings at once (since the string is only parsed once by Numpy internally).

On my machine, about 2/3 of the time is spent in the loop and 1/3 in the Numpy string conversion (a minor part is coming from some additional Pandas overheads).