Currently I have a dataframe that I rank the values of each column and output them into a new dataframe. Example code below:
df = pd.DataFrame(np.random.randint(0, 500, size=(500, 1000)), columns=list(range(0, 1000)))
ranking = pd.DataFrame(range(0, 500), columns=['Lineup'])
ranking = pd.concat([ranking, df[range(0, 1000)].rank(ascending=False, method='min')],
axis=1)
df is the dataframe of the values, each column header is an integer, increasing by 1 for each successive column. ranking is first created with a single column as a identifier by "Lineup" then the dataframe "df" is concatenated and ranked at the same time.
Now the question is, is this the fastest way to do this? When there are tens of thousands of columns and hundreds of rows this can take far longer than I hoped it would. Is there a way to use list comprehension to speed this up or some sort of other method that outputs a list, dictionary, dataframe or anything else I can use for my future steps.
Thanks
CodePudding user response:
You can use the Numba JIT to compute this more efficiently and in parallel. The idea is to compute the rank of each column in parallel. Here is the resulting code:
# Equivalent of df.rank(ascending=False, method='min')
@nb.njit('int32[:,:](int32[:,:])', parallel=True)
def fastRanks(df):
n, m = df.shape
res = np.empty((n, m), dtype=np.int32)
for col in nb.prange(m):
dfCol = -df[:, col]
order = np.argsort(dfCol)
# Compute the ranks with the min method
if n > 0:
prevVal = dfCol[order[0]]
prevRank = 1
res[order[0], col] = 1
for row in range(1, n):
curVal = dfCol[order[row]]
if curVal == prevVal:
res[order[row], col] = prevRank
else:
res[order[row], col] = row 1
prevVal = curVal
prevRank = row 1
return res
df = pd.DataFrame(np.random.randint(0, 500, size=(500, 1000)), columns=list(range(0, 1000)))
ranking = pd.DataFrame(range(0, 500), columns=['Lineup'])
ranking = pd.concat([ranking, pd.DataFrame(fastRanks(df[range(0, 1000)].to_numpy()))], axis=1)
On my 6-core machine, the computation of the ranks is about 7 times faster. The overall computation is bounded by the slow pd.concat
.
You can further improve the speed of the overall computation by building directly the output of fastRanks
with the "Lineup" column. The names of the dataframe columns have to be set manually from the Numpy array produced by the Numba function. Note that this optimization require all the columns to be of the same type, which is the case in your example.
Note that the ranks are of types int32
in this solution for sake of performance (since float64
are not needed here).