I found that the retrieval speed of the dataframe is very fast. I created 1 million rows of dataframe, and it only took less than 1 second to filter the required data. But why is it so slow when I use the append method to add data to an empty dataframe?
Here is my code, which took more than 2 hours to execute. What am I missing? Or is there a better way to add data than df.append mothod?
import pandas as pd
import datetime
import random
data = pd.DataFrame(columns=('Open','High','Low','Close','Avg20'))
start = datetime.datetime.now()
for i in range(1000000):
if i % 10000 == 0:
print(i/1000000*100 , '%completed.')
data = data.append({'Open':random.random(), 'High':random.random(), 'Low':random.random(), 'Close':random.random(),'Avg9':random.random()},ignore_index=True)
end = datetime.datetime.now()
print(start, end)
Thanks in advance.
CodePudding user response:
DataFrame append is slow since it effectively means creating an entirely new DataFrame from scratch.
If you just wanted to optimize the code above, you could append all your rows to a list rather than DataFrame (since appending to list is fast) then create the DataFrame outside the loop - passing the list of data.
Similarly if you need to combine many DataFrames, it's fastest to do via a single call to pd.concat rather than many calls to DataFrame.append.