How to find the maximum value of 3 columns and multiply the other 2 with a fixed factor, then add al-CodePudding

I have a large dataframe df:

df = pd.DataFrame(
    {
        "length": [1200,600, 600,1201, 600, 600, 500, 700],
        "width": [600, 1200, 600, 600, 1201, 600, 800, 700],
        "height": [600, 600, 1200, 600, 600, 1201, 800, 700],
    }
)

length	width	height
1200	600	600
600	1200	600
600	600	1200
1201	600	600
600	1201	600
600	600	1201
500	800	800
700	700	700

I need to find the highest value per row, then multiply the other with *2 and add all together.

So the formula is:

highest_value   (lower_value_1 * 2)   (lower_value_2 * 2)

E.g. for row 1: 1200 (600 * 2) (600 * 2) = 3600
E.g. for row 7: 800 (800 * 2) (500 * 2) = 3400
E.g. for row 8: 700 (700 * 2) (700 * 2) = 3500

The output should be:

length	width	height	girth
1200	600	600	3600
600	1200	600	3600
600	600	1200	3600
1201	600	600	3601
600	1201	600	3601
600	600	1201	3601
500	500	800	3400
700	700	700	3500

I can find the maximum for each row with

df[["length ", "width ", "height "]].max(axis=1)

but I need help with the rest.

CodePudding user response：

This seems to be a simpler solution

df['New'] = df.sum(axis=1)*2 - df.max(axis=1)

Edit, was meant to copy this

df['Girth'] = df.apply(lambda x: sum(x)*2 - max(x), axis = 1)

CodePudding user response：

Here is a very fast, vectorized solution that takes advantage of numpy broadcasting to achieve very high performance:

mask = pd.DataFrame(df.to_numpy() != df.max(axis=1).to_numpy().reshape(-1,1), columns=df.columns)
df['girth'] = df.add(df[mask].fillna(0)).sum(axis=1)

Output:

>>> df
   length  width  height   girth
0    1200    600     600  3600.0
1     600   1200     600  3600.0
2     600    600    1200  3600.0
3    1201    600     600  3601.0
4     600   1201     600  3601.0
5     600    600    1201  3601.0

On my 3.6 GHz 8-Core Intel Core i9, this took about ~950ms for 6,000,000 (6 million) rows.

Basically what we're doing here is:

Creating a mask that selects all the values in the dataframe that are not the max of the row they're in
Adding the values selected by that mask (the non-max values) to the dataframe again
Summing the resulting rows