Home > other >  Efficiently find the minimum value in a DataFrame
Efficiently find the minimum value in a DataFrame

Time:10-22

I can get the minimum value, along with its row and column, on a small dataset using the usual Pandas vectorized operations:

import pandas as pd

data = pd.DataFrame([
  ['A', 'asdf', 1, 2, 3],
  ['B', 'zxcv', 4, 5, 6],
  ['C', 'qwer', 2, 5, 3],
  ['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009']).set_index('entity')

data_num = data.drop(columns=['code'])

minval = data_num.min().min()
minval_row = data_num.min(axis='columns').idxmin()
minval_col = data_num.min().idxmin()

print(minval, minval_row, minval_col)
# 1 A 2007

However, this requires 3 separate scans over the data, which is inefficient.

It's algorithmically possible to get all 3 of these values in a single pass over the data. Is this possible using the Pandas or Numpy APIs?

Note that I also want to avoid copying the data. I think if you have data of all the same dtype, .to_numpy() won't make a copy, but I'm not 100% sure.

CodePudding user response:

I think you can stack:

s = data_num.stack()
minval = s.idxmin()

print ([s[minval], *minval])

[1, 'A', '2007']

CodePudding user response:

If performance is an issue, numpy is much faster (~10× on 1M values).

You can get the min from the flatten underlying array using numpy.argmin (single pass), and send back-calculate the position using simple math (fast operation):

import numpy as np
idx = np.argmin(data_num.values)
col = data_num.columns[idx%len(data_num.columns)]
row = data_num.index[idx//len(data_num.columns)]
col, row

output:

('2007', 'A')

Timing on a 1M values dataframe (1000×1000):

# numpy solution above
943 µs ± 5.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# pandas df.stack().idxmin()
10.7 ms ± 844 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • Related