I can get the minimum value, along with its row and column, on a small dataset using the usual Pandas vectorized operations:
import pandas as pd
data = pd.DataFrame([
['A', 'asdf', 1, 2, 3],
['B', 'zxcv', 4, 5, 6],
['C', 'qwer', 2, 5, 3],
['D', 'hjkl', 3, 9, 4],
], columns=['entity', 'code', '2007', '2008', '2009']).set_index('entity')
data_num = data.drop(columns=['code'])
minval = data_num.min().min()
minval_row = data_num.min(axis='columns').idxmin()
minval_col = data_num.min().idxmin()
print(minval, minval_row, minval_col)
# 1 A 2007
However, this requires 3 separate scans over the data, which is inefficient.
It's algorithmically possible to get all 3 of these values in a single pass over the data. Is this possible using the Pandas or Numpy APIs?
Note that I also want to avoid copying the data. I think if you have data of all the same dtype, .to_numpy()
won't make a copy, but I'm not 100% sure.
CodePudding user response:
I think you can stack
:
s = data_num.stack()
minval = s.idxmin()
print ([s[minval], *minval])
[1, 'A', '2007']
CodePudding user response:
If performance is an issue, numpy is much faster (~10× on 1M values).
You can get the min from the flatten underlying array using numpy.argmin
(single pass), and send back-calculate the position using simple math (fast operation):
import numpy as np
idx = np.argmin(data_num.values)
col = data_num.columns[idx%len(data_num.columns)]
row = data_num.index[idx//len(data_num.columns)]
col, row
output:
('2007', 'A')
Timing on a 1M values dataframe (1000×1000):
# numpy solution above
943 µs ± 5.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# pandas df.stack().idxmin()
10.7 ms ± 844 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)