Home > Software design >  Python lambda on a pandas dataframe containing 2d Numpy arrays
Python lambda on a pandas dataframe containing 2d Numpy arrays

Time:09-22

I need to convert several numpy arrays according to this rule. Get several arrays. Do elementwise comparison. If the given array at this position has value greater than 0.5 and is the greatest among all arrays at this index, then the value in this position of the corresponding output array is one. Otherwise - zero.

import pandas as pd
import numpy as np

def max_is_greater_than_half_1d(*args):
    df = pd.DataFrame(dict({'col_' str(i 1): val for i, val in enumerate(args)}))
    max_val = df.apply(max, axis=1)
    df = df.apply(lambda x: (x > 0.5) & (max_val == x), axis=0).astype(int)
    return [np.array(df[col].values) for col in df.columns]

in_1=np.array([0.4, 0.7, 0.8, 0.3, 0.3])
in_2=np.array([0.9, 0.8, 0.6, 0.4, 0.4])
in_3=np.array([0.5, 0.5, 0.5, 0.2, 0.6])

out_1, out_2, out_3 = max_is_greater_than_half(in_1, in_2,in_3)
# out_1: [0, 0, 1, 0, 0]
# out_2: [1, 1, 0, 0, 0]
# out_3: [0, 0, 0, 0, 1]

This works. But how can I do the same operation on several 2d arrays?

example:

in_1=np.array([[0.4, 0.7], [0.8, 0.3]])
in_2=np.array([[0.9, 0.8], [0.6, 0.4])

out_1 = [[0, 0], [1, 0]] and out_2 = [[1, 1], [0, 0]]

In my case I have six 2000x2000 arrays, so an elementwise operation is going to be too slow. An operation on a whole array is preferable.

CodePudding user response:

It's almost the same code,

def max_is_greater_than_half_2d(*args):
    df = pd.DataFrame(dict({'col_' str(i 1): val.flatten() for i, val in enumerate(args)}))
    max_val = df.apply(max, axis=1)
    df = df.apply(lambda x: (x > 0.5) & (max_val == x), axis=0).astype(int)
    return [np.array(df[col].values.reshape(-1,2)) for col in df.columns]

in_1=np.array([[0.4, 0.7], 
               [0.8, 0.3]])
in_2=np.array([[0.9, 0.8], 
               [0.6, 0.4]])
max_is_greater_than_half_2d(in_1, in_2)

or you could reuse the 1D function,

def max_is_greater_than_half_2d(*args):
    
    args = [val.flatten() for val in args]
    out  = max_is_greater_than_half_1d(*args)
    return [np.array(val.reshape(-1,2)) for val in out]

CodePudding user response:

Here is a faster way to do this comparison:

def max_is_greater_than_half_2d_numpy(*args):
    max_arr = np.maximum.reduce(args)
    res = []
    for ar in args:
        res.append(np.where(np.logical_or(ar< max_arr, ar<0.5), 0.0, 1.0))
    return res

# 0.026 s to create two arrays below
rand_1=np.random.default_rng().random((2000, 2000),dtype=np.float32)
rand_2=np.random.default_rng().random((2000, 2000),dtype=np.float32)

# 0.05 s with np.maximum.reduce, np.where
r_out_v1 = max_is_greater_than_half_2d_numpy(rand_1, rand_2)  

# 45.0 s with pandas dataframe
r_out_v2 = max_is_greater_than_half_2d_pd_df(rand_1, rand_2)  # 45.0 s

Numpy np.maximum.reduce followed by np.where is about 1000 times faster than the pandas.dataframe approach, so, I will accept my own answer.

  • Related