trying to figure out a pythonic way of code that is taking time even after using list comprehension-CodePudding

I have two dataframes: one comprising a large data set, allprice_df, with time price series for all stocks; and the other, init_df, comprising selective stocks and trade entry dates. I am trying to find the highest price for each ticker symbol and its associated date.

The following code works but it is time consuming, and I am wondering if there is a better, more Pythonic way to accomplish this.

# Initial call
init_df = init_df.assign(HighestHigh = lambda x: 
  highestHigh(x['DateIdentified'], x['Ticker'], allprice_df))

# HighestHigh function in lambda call
def highestHigh(date1,ticker,allp_df):
      if date1.size == ticker.size:
        temp_df = pd.DataFrame(columns = ['DateIdentified','Ticker'])
        temp_df['DateIdentified'] = date1
        temp_df['Ticker'] = ticker
    else:
        print("dates and tickers size mismatching")
        sys.exit(1)
    counter = itertools.count(0)
    high_list = [getHigh(x,y,allp_df, next(counter)) for x, y in zip(temp_df['DateIdentified'],temp_df['Ticker'])]
    return high_list

# Getting high for each ticker
def getHigh(dateidentified,ticker,allp_df, count):
     print("trade %s" % count)
    currDate = datetime.datetime.now().date()
    allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]
    hh = allpm_df.iloc[:,0].max()
    hd = allpm_df.loc[(allpm_df['high']==hh),'date']
    hh = round(hh,2)
    h_list  = [hh,hd]
    return h_list

# Split the list in to 2 columns one with price and the other with the corresponding date
init_df = split_columns(init_df,"HighestHigh")

# The function to split the list elements in to different columns
def split_columns(orig_df,col):
    split_df = pd.DataFrame(orig_df[col].tolist(),columns=[col "Mod", col "Date"])
    split_df[col "Date"] = split_df[col "Date"].apply(lambda x: x.squeeze())
    orig_df = pd.concat([orig_df,split_df], axis=1)
    orig_df = orig_df.drop(col,axis=1)
    orig_df = orig_df.rename(columns={col "Mod": col})
    return orig_df

CodePudding user response：

There are a couple of obvious solutions that would help reduce your runtime.

First, in your getHigh function, instead of using loc to get the date associated with the maximum value for high, use idxmax to get the index of the row associated with the high and then access that row:

hh, hd = allpm_df[allpm_df['high'].idxmax()]

This will replace two O(N) operations (finding the maximum in a list, and doing a list lookup using a comparison) with one O(N) operation and one O(1) operation.

Edit

In light of your information on the size of your dataframes, my best guess is that this line is probably where most of your time is being consumed:

allpm_df = allp_df.loc[((allp_df['Ticker']==ticker)&(allp_df['date']>dateidentified)&(allp_df['date']<=currDate)),['high','date']]

In order to make this faster, I would setup your data frame to include a multi-index when you first create the data frame:

index = pd.MultiIndex.from_arrays(arrays = [ticker_symbols, dates], names = ['Symbol', 'Date'])
allp_df = pd.Dataframe(data, index = index)
allp_df.index.sortlevel(level = 0, sort_remaining = True)

This should create a dataframe with a sorted, multi-level index associated with your ticker symbol and date. Doing this will reduce your search time tremendously. Once you do that, you should be able to access all the data associated with a ticker symbol and a given date-range by doing this:

allp_df[ticker, (dateidentified: currDate)]

which should return your data much more quickly. For more information on multi-indexing, check out this helpful Pandas tutorial.