I have a pandas.DataFrame with more than one numerical columns and would like to find a maximum across row so I did below,
df = pd.DataFrame(np.random.random(size=(100000, 10)))
max_series = df.max(axis=1)
# O/P is a pd.Series like below
0 0.741459
1 0.995978
2 0.978618
3 0.973057
4 0.838006
...
Next we want to find the index of maximum value. So I did below,
filter_ = df.idxmax(axis=1)
# O/P
0 3
1 8
2 7
3 5
4 1
..
Now using the filter_
on DataFrame I want to achieve the result same as max_series
variable and without using the pd.DataFrame.max(axis=1)
So I tried below,
df.loc[:, filter_]
or
df.filter(items=filter_, axis=1)
but both give me
MemoryError: Unable to allocate 74.5 GiB for an array with shape (100000, 100000) and data type float64
I don't need a 100000x100000 matrix
, I just need my max_series
which is 100000x1
So how do I filter the DataFrame using the filter_
and get the pd.Series
of maximum across rows?
CodePudding user response:
This could be a faster solution:
%%time
df = pd.DataFrame(np.random.random(size=(100000, 10)))
max_series = df.max(axis=1)
filter_ = df.idxmax(axis=1)
unique_cols = filter_.unique()
max_series_ = pd.concat([df.loc[df.index.isin(filter_[filter_ == col].index), col] for col in unique_cols]).sort_index()
from pandas.testing import assert_series_equal
assert_series_equal(max_series_, max_series)
Maybe it can be even optimized further.
CodePudding user response:
This could be one of the solutions,
filter_ = df.idxmax(axis=1)
df.apply(lambda row: row[filter_.loc[row.name]], axis=1)