Home > OS >  Filtering pandas dataframe using output from idxmax function
Filtering pandas dataframe using output from idxmax function

Time:06-22

I have a pandas.DataFrame with more than one numerical columns and would like to find a maximum across row so I did below,

df = pd.DataFrame(np.random.random(size=(100000, 10)))
max_series = df.max(axis=1)

# O/P is a pd.Series like below

0     0.741459
1     0.995978
2     0.978618
3     0.973057
4     0.838006
        ...   

Next we want to find the index of maximum value. So I did below,

filter_ = df.idxmax(axis=1)

# O/P
0     3
1     8
2     7
3     5
4     1
 ..

Now using the filter_ on DataFrame I want to achieve the result same as max_series variable and without using the pd.DataFrame.max(axis=1)

So I tried below,

df.loc[:, filter_]

or

df.filter(items=filter_, axis=1)

but both give me

MemoryError: Unable to allocate 74.5 GiB for an array with shape (100000, 100000) and data type float64

I don't need a 100000x100000 matrix, I just need my max_series which is 100000x1

So how do I filter the DataFrame using the filter_ and get the pd.Series of maximum across rows?

CodePudding user response:

This could be a faster solution:

%%time
df = pd.DataFrame(np.random.random(size=(100000, 10)))
max_series = df.max(axis=1)

filter_  = df.idxmax(axis=1)

unique_cols = filter_.unique()
max_series_ = pd.concat([df.loc[df.index.isin(filter_[filter_ == col].index), col] for col in unique_cols]).sort_index()
    

from pandas.testing import assert_series_equal

assert_series_equal(max_series_, max_series)

Maybe it can be even optimized further.

CodePudding user response:

This could be one of the solutions,

filter_ = df.idxmax(axis=1)
df.apply(lambda row: row[filter_.loc[row.name]], axis=1)
  • Related