I'm trying to find a more optimized way to add data to a pandas dataframe. I already saw other related questions where people suggested to first create lists and then add the data to pandas (which I now implemented).
In my current setup I loop through different lists (in the example it is librarynr
, books
and sections
) and then compute various variables (in the example those are not computed but already set; nrofletters
, excitment
and review
) which I add to lists and in the end add the lists to the dataframe.
Does anyone know of further optimizations to improve performance on this example code?
Important note: In my final code, the variables are not the same for all rows, but computed depending on the iterators of the loops (see example calculation of excitment
).
Example code:
import pandas as pd
import time
books = ['LordOfTheRings','HarryPotter','LoveStory','RandomBook']
sections = ['Introduction','MainPart','Plottwist','SurprisingEnd']
librarynr = list(range(30000))
nrofletters = 3000
excitment = True
review = 'positive'
start_time = time.time()
summarydf = pd.DataFrame()
indexlist = []
nrofletterlist = []
excitmentlist = []
reviewlist = []
for library in librarynr:
for book in books:
for section in sections:
indexlist.append(str(library) book section)
nrofletterlist.append(nrofletters)
#example of variable calculation depending on iterators of loop:
if (library % 2 == 0) or (book[1] == 'L'):
excitment = False
else:
excitment = True
excitmentlist.append(excitment)
reviewlist.append(review)
summarydf['index'] = indexlist
summarydf['nrofletters'] = nrofletterlist
summarydf['excitment'] = excitmentlist
summarydf['review'] = reviewlist
listtime = time.time() - start_time
print(listtime)
CodePudding user response:
Append is very slow, you should produce your DataFrame in one shot.
IIUC, you want a product of all possibilities. You can use itertools.product
I am giving here an example with only librarynr = 5
. Your condition with librarynr = 300000
would produce 4.8 millions rows.
from itertools import product
librarynr = 5
idx = map(''.join, product(map(str, range(librarynr)), books, sections))
df = pd.DataFrame([], index=idx)
df[['nrofletters', 'excitment', 'review']] = [3000, True, 'positive']
Output:
>>> print(df.reset_index())
index nrofletters excitment review
0 0LordOfTheRingsIntroduction 3000 True positive
1 0LordOfTheRingsMainPart 3000 True positive
2 0LordOfTheRingsPlottwist 3000 True positive
3 0LordOfTheRingsSurprisingEnd 3000 True positive
4 0HarryPotterIntroduction 3000 True positive
.. ... ... ... ...
75 4LoveStorySurprisingEnd 3000 True positive
76 4RandomBookIntroduction 3000 True positive
77 4RandomBookMainPart 3000 True positive
78 4RandomBookPlottwist 3000 True positive
79 4RandomBookSurprisingEnd 3000 True positive
[80 rows x 4 columns]
CodePudding user response:
CPython loops are very slow due to the interpreter (note that there are faster Python interpreter like PyPy which use a JIT compiler). You can use comprehension list to speed up significantly the loops. Using itertools
can help a bit more as well as converting library
to a string in the outermost loop. However, the result is still not very fast regarding the performed operation.
Another issue comes from the conversion from Python list to Numpy arrays. Indeed, Pandas use Numpy internally and Numpy convert each string reference to a static bounded string (so using a big raw memory buffer and not an array of reference-counted objects). This means that each strings are parsed and copied by Numpy which is very expensive. The best solution is to directly write the Numpy array using vectorized functions if possible. If this is not possible, you can use Numba. However, note that Numba barely supports string arrays so far. Another possible solution is you use Cython. Using a direct assignment of Pandas can be very fast too to set all the strings at once (since the string is only parsed once by Numpy internally).
On my machine, about 2/3 of the time is spent in the loop and 1/3 in the Numpy string conversion (a minor part is coming from some additional Pandas overheads).