Code much slower in script than when executed independently in console-CodePudding

The Issue:

Alright so I've got a script that takes forever to run. Debugging and using the profiling tools in pycharm has helped me narrow down the bottleneck to a bit of code. This is part of a much larger script ofc so I've come up with this bit of code for testing.

t1=time()
index_mask = lambda quarter: (df['Timestamp']>=quarter['start']) & (df['Timestamp']<=quarter['end'])
masks=[index_mask(quarter) for quarter in quarters]
print(time()-t1)

Progress so far:

Currently, my method for comparison is setting a breakpoint in PyCharm before the troublesome code, and using the debug console to run the code snippet above. Then I'll do the same in a separate Ipython console. The former takes 40-60 seconds to run, while the Ipython console finishes in under a second. Note that the script also takes a very long time if I just call it from the command prompt with >python myscript.py, so I don't think it's a PyCharm issue necessarily.

(I know the time module isn't as accurate as timeit etc, but I just need to get rough estimates)

I'm at an absolute loss as to why there is such a difference in run time. Any advice would be greatly appriciated.

Setup Code:

This code will make the variables and such to run the code snippet above.

import pandas as pd
from time import time
import datetime as dt
from random import random

def quarter_bounds(start_date,end_date):

    # calculate the number of quarters within the input bounding dates:
    number_of_quarters = round((end_date - start_date).days / 365.25 * 4)

    # functions to calculate a quarter's first and last hours:
    quarter_start = lambda x : dt.datetime(start_date.year   int(x / 4), (x % 4) * 3   1, 1, 0, 0)
    quarter_end = lambda x : dt.datetime(start_date.year int((x 1)/4),((x 1)%4)*3 1,1,0,0) dt.timedelta(hours=-1)

    # return list of dictionaries with quarter index, quarter number (0-3), and bounding hours:
    return [{
        'index' : x,
        'quarter' : x%4,
        'start' : quarter_start(x),
        'end' : quarter_end(x)
    } for x in range(number_of_quarters)]


df=pd.DataFrame(columns=['PA','CZ','DT','Timestamp','Units','Value'],data=[['xxx','xxx','xx',pd.Timestamp(year=2020, month=1,day=1, hour=0, minute=0, second=0) pd.DateOffset(hours=i),'MWH',random()*10] for i in range(0,271560)])

start_date = df.loc[:,'Timestamp'].min()
end_date = df.loc[:,'Timestamp'].max()

quarters = quarter_bounds(start_date,end_date)

CodePudding user response：

Ok I've come as close to a solution as I can. Maybe some other folks can shed more light on why this works, but here's what I found.

For whatever reason, when the code in myscript.py reads in the data from some csv's, the resulting dataframes get borked. Not sure why or how, but if I "reset" some of the data, I get the fast performance I was looking for.

df['Value']=df['Value'].apply(np.float64) #already a float
df['Timestamp']=df['Timestamp'].apply(str).apply(lambda x: dt.datetime.strptime(x,'%Y-%m-%d %H:%M:%S')) #already pandas timestamp
df.reset_index(inplace=True)
df=df.set_index(['PA','CZ','DT').sort_index()

As you can see, this code doesn't really change the data, not permanently at least, but it makes a huge difference. I tried some experiments like just resetting the index and not changing Value or Timestamp, and these yielded some improvement, but not as much as doing them all. Not sure if I just hit the perfect storm of inefficiency, but hope this helps others, or myself, in the future.