Caclulate decile ranking in pandas for a large dataset-CodePudding

See the following datasets:

Date	ticker	overnight_return
2017-07-20	CLXT	0.019556
2017-07-21	CLXT	0.039778
2022-02-14	ETNB	-0.006186
2022-02-15	ETNB	0.024590

I am currently testing a hypothesis based on the overnight return factor. I would like to first apply ranking for all the unique values in tickercolumn on every Date. And then z-scoreing for the ranks. Lastly, I would like to rank them in terms of deciles.

I used the following code to get the final z-score for one date:

import scipy.stats as stats
stats.zscore(equity_daily[equity_daily.Date == "2017-07-20"].overnight_return.rank().dropna().values)

Now I would like to get the z-score for every day based on the rank of all tickers on that particular day.

My apporach was to get the pivoted table then create a new table containing the z-scores.

equity_daily.pivot(columns = "ticker", values = "overnight_return", index = "Date")

But the following error occured:

ValueError: Index contains duplicate entries, cannot reshape

Desired outcome:

Date	ticker	overnight_return	Decile_rank
2017-07-20	CLXT	0.019556	0
2017-07-21	CLXT	0.039778	2
2022-02-14	ETNB	-0.006186	9
2022-02-15	ETNB	0.024590	8

CodePudding user response：

Not having more of a data sample, hard to test myself, but...

Try pivot_table() instead of pivot(). pivot doesn't do aggregations

CodePudding user response：

I have set up multiple helper columns.

def z_score(x):
    """Helper function for Normalization"""
    return stats.zscore(x)
def decile_rank(x):
    """Helper function for decile ranking"""
    return pd.qcut(x, 10, labels = False)

equity_daily["overnight_rank"] = equity_daily.groupby("Date")["overnight_return"].rank()
equity_daily["overnight_normalized"] = equity_daily.groupby("Date")["overnight_rank"].apply(z_score)
equity_daily["decile_rank"] = equity_daily.gruopby("Date")["overnight_normalizaed"].apply(decile_rank)

Howeve, the folloiwng error occured:

ValueError: Input array must be 1 dimensional