Home > Software design >  More efficient way to rank columns in a dataframe
More efficient way to rank columns in a dataframe

Time:11-10

Currently I have a dataframe that I rank the values of each column and output them into a new dataframe. Example code below:

df = pd.DataFrame(np.random.randint(0, 500, size=(500, 1000)), columns=list(range(0, 1000)))

ranking = pd.DataFrame(range(0, 500), columns=['Lineup'])

ranking = pd.concat([ranking, df[range(0, 1000)].rank(ascending=False, method='min')],
                        axis=1)

df is the dataframe of the values, each column header is an integer, increasing by 1 for each successive column. ranking is first created with a single column as a identifier by "Lineup" then the dataframe "df" is concatenated and ranked at the same time.

Now the question is, is this the fastest way to do this? When there are tens of thousands of columns and hundreds of rows this can take far longer than I hoped it would. Is there a way to use list comprehension to speed this up or some sort of other method that outputs a list, dictionary, dataframe or anything else I can use for my future steps.

Thanks

CodePudding user response:

You can use the Numba JIT to compute this more efficiently and in parallel. The idea is to compute the rank of each column in parallel. Here is the resulting code:

# Equivalent of df.rank(ascending=False, method='min')
@nb.njit('int32[:,:](int32[:,:])', parallel=True)
def fastRanks(df):
    n, m = df.shape
    res = np.empty((n, m), dtype=np.int32)

    for col in nb.prange(m):
        dfCol = -df[:, col]
        order = np.argsort(dfCol)

        # Compute the ranks with the min method
        if n > 0:
            prevVal = dfCol[order[0]]
            prevRank = 1
            res[order[0], col] = 1

            for row in range(1, n):
                curVal = dfCol[order[row]]
                if curVal == prevVal:
                    res[order[row], col] = prevRank
                else:
                    res[order[row], col] = row   1
                    prevVal = curVal
                    prevRank = row   1

    return res

df = pd.DataFrame(np.random.randint(0, 500, size=(500, 1000)), columns=list(range(0, 1000)))
ranking = pd.DataFrame(range(0, 500), columns=['Lineup'])
ranking = pd.concat([ranking, pd.DataFrame(fastRanks(df[range(0, 1000)].to_numpy()))], axis=1)

On my 6-core machine, the computation of the ranks is about 7 times faster. The overall computation is bounded by the slow pd.concat.

You can further improve the speed of the overall computation by building directly the output of fastRanks with the "Lineup" column. The names of the dataframe columns have to be set manually from the Numpy array produced by the Numba function. Note that this optimization require all the columns to be of the same type, which is the case in your example.

Note that the ranks are of types int32 in this solution for sake of performance (since float64 are not needed here).

  • Related