Find max value of a dataframe column containing numpy arrays-CodePudding

I was trying to find the maximum value of a column in a dataframe that contains numpy arrays.

df = pd.DataFrame({'id': [1, 2, 33, 4],
                   'a': [1, 22, 23, 44],
                   'b': [1, 42, 23, 42]})
df['new'] = df.apply(lambda r: tuple(r), axis=1).apply(np.array)

This how the dataframe can look like:

    id  a   b   new
0   1   1   1   [1, 1, 1]
1   2   22  42  [2, 22, 42]
2   33  23  23  [33, 23, 23]
3   4   44  42  [4, 44, 42]

Now I want to find the maximum (single) value of column new. In this case it is 44. What about a quick and easy way?

CodePudding user response：

Because your new column is actually constructed from the columns id, a, b. Before you create the new column you can do:

single_max = np.max(df.values)

OR if you insist on your dataframe to contain the new column and then get max you can do:

single_max = np.max(df.drop('new',axis=1).values)

CodePudding user response：

You can apply a lambda to the values that calls the array's max method. This would result in a Series that also has a max method.

df['new'].apply(lambda arr: arr.max()).max()

Just guessing, but this should be faster than .apply(max) because you use the optimized array method instead of converting the numpy ints to python ints one by one.

CodePudding user response：

A possible solution:

df.new.explode().max()

Or a faster alternative:

np.max(np.vstack(df.new.values))

Returns 44.

CodePudding user response：

Assuming you only want to consider the columns "new":

import numpy as np
out = np.max(tuple(df['new'])) # or np.max(df['new'].tolist())

Output: 44