How can I optimize this pandas dataframe code-CodePudding

I have a large dataframe (resultsDF) set up with a sort-of three column index: [Ticker, Fiscal Year, Fiscal Period] like so

      Ticker  Fiscal Year Fiscal Period  Market Returns       ROI
21        AA         2017            Q3        0.028904  0.006549
23        AA         2019            Q2        0.038022 -0.026191
24        AA         2018            Q3        0.024746 -0.002533
25        AA         2018            Q4       -0.058857  0.004141
26        AA         2019            Q3        0.010057 -0.015065
...      ...          ...           ...             ...       ...
36582    ZTS         2017            Q1        0.056305  0.031115
36583    ZTS         2019            Q1        0.014543  0.028669
36584    ZTS         2018            Q4       -0.058857  0.032013
36585    ZTS         2019            Q4        0.033800  0.033261
36586    ZTS         2020            Q1       -0.041786  0.036693

Part one: I'm creating a column Beat Market (Overall) to track if the average ROI for the ticker is greater than the average return of the market. Since it tracks the averages, that new column will be either one or zero for the whole stock. So every value in Beat Market (Overall) for AA might be 1, where as every value in Beat Market (Overall) for ZTS might be 0. The complicated thing about it is that those ROI's would need to be summed for each ticker, possibly using the groupby function, but I'm not sure how to use that in this context.

Part two: I have a lookup dataframe, companies, that holds industry id information for each ticker. I want to match up the Sector column in the final dataframe to the correct industry id for each ticker without using a loop.

The problem is that running this code for this dataset takes a way too long, and if I could vectorize it, it would be much faster.

Here's the loop I'd like to vectorize:

# find if AVG returns for each ticker is greater than AVG returns for market
avg_ror = resultsDF["Market Returns"].unique().mean()

for index, row in resultsDF.iterrows():
    roi = resultsDF[resultsDF["Ticker"] == row["Ticker"]]["ROI"].mean()
    resultsDF.loc[index, "Beat Market (Overall)"] = 1 if roi > avg_ror else 0
    try:
        resultsDF.loc[index, "Sector"] = companies.loc[row["Ticker"]][
            "IndustryId"
        ]  # ["Sector"]
    except:
        pass

The final dataframe (less some other columns) would look like this:

      Ticker  Fiscal Year Fiscal Period    Sector  Market Returns       ROI  Beat Market (Overall)
21        AA         2017            Q3    110004        0.028904  0.006549                    0.0
23        AA         2019            Q2    110004        0.038022 -0.026191                    0.0
24        AA         2018            Q3    110004        0.024746 -0.002533                    0.0
25        AA         2018            Q4    110004       -0.058857  0.004141                    0.0
26        AA         2019            Q3    110004        0.010057 -0.015065                    0.0
...      ...          ...           ...       ...             ...       ...                    ...
36582    ZTS         2017            Q1  106005.0        0.056305  0.031115                    1.0
36583    ZTS         2019            Q1  106005.0        0.014543  0.028669                    1.0
36584    ZTS         2018            Q4  106005.0       -0.058857  0.032013                    1.0
36585    ZTS         2019            Q4  106005.0        0.033800  0.033261                    1.0
36586    ZTS         2020            Q1  106005.0       -0.041786  0.036693                    1.0

CodePudding user response：

(1) You can groupby on Ticker find group mean, transform it and compare it to market returns mean for every item in the dataframe

(2) You can use .map method to map to Sector in companies dataframe (here I assumed AA is in sector A and ZTS is in B)

resultsDF['Beat Market (Overall)'] = (resultsDF.groupby('Ticker')['ROI'].transform(np.mean) > resultsDF['Market Returns'].unique().mean()).astype(int)
resultsDF['Sector'] = resultsDF['Ticker'].map(companies.set_index('Ticker')['IndustryId'])

Output:

      Ticker  Fiscal Year Fiscal Period  Market Returns    ROI  Beat Market Sector
21        AA         2017            Q3        0.028904  0.006549     0      A
23        AA         2019            Q2        0.038022 -0.026191     0      A
24        AA         2018            Q3        0.024746 -0.002533     0      A
25        AA         2018            Q4       -0.058857  0.004141     0      A
26        AA         2019            Q3        0.010057 -0.015065     0      A
36582    ZTS         2017            Q1        0.056305  0.031115     1      B
36583    ZTS         2019            Q1        0.014543  0.028669     1      B
36584    ZTS         2018            Q4       -0.058857  0.032013     1      B
36585    ZTS         2019            Q4        0.033800  0.033261     1      B
36586    ZTS         2020            Q1       -0.041786  0.036693     1      B