Home > Back-end >  How to find the maximum value of 3 columns and multiply the other 2 with a fixed factor, then add al
How to find the maximum value of 3 columns and multiply the other 2 with a fixed factor, then add al

Time:12-11

I have a large dataframe df:

df = pd.DataFrame(
    {
        "length": [1200,600, 600,1201, 600, 600, 500, 700],
        "width": [600, 1200, 600, 600, 1201, 600, 800, 700],
        "height": [600, 600, 1200, 600, 600, 1201, 800, 700],
    }
)
length width height
1200 600 600
600 1200 600
600 600 1200
1201 600 600
600 1201 600
600 600 1201
500 800 800
700 700 700

I need to find the highest value per row, then multiply the other with *2 and add all together.

So the formula is:

highest_value   (lower_value_1 * 2)   (lower_value_2 * 2)
  • E.g. for row 1: 1200 (600 * 2) (600 * 2) = 3600
  • E.g. for row 7: 800 (800 * 2) (500 * 2) = 3400
  • E.g. for row 8: 700 (700 * 2) (700 * 2) = 3500

The output should be:

length width height girth
1200 600 600 3600
600 1200 600 3600
600 600 1200 3600
1201 600 600 3601
600 1201 600 3601
600 600 1201 3601
500 500 800 3400
700 700 700 3500

I can find the maximum for each row with

df[["length ", "width ", "height "]].max(axis=1)

but I need help with the rest.

CodePudding user response:

This seems to be a simpler solution

df['New'] = df.sum(axis=1)*2 - df.max(axis=1)

Edit, was meant to copy this

df['Girth'] = df.apply(lambda x: sum(x)*2 - max(x), axis = 1)

CodePudding user response:

Here is a very fast, vectorized solution that takes advantage of numpy broadcasting to achieve very high performance:

mask = pd.DataFrame(df.to_numpy() != df.max(axis=1).to_numpy().reshape(-1,1), columns=df.columns)
df['girth'] = df.add(df[mask].fillna(0)).sum(axis=1)

Output:

>>> df
   length  width  height   girth
0    1200    600     600  3600.0
1     600   1200     600  3600.0
2     600    600    1200  3600.0
3    1201    600     600  3601.0
4     600   1201     600  3601.0
5     600    600    1201  3601.0

On my 3.6 GHz 8-Core Intel Core i9, this took about ~950ms for 6,000,000 (6 million) rows.


Basically what we're doing here is:

  1. Creating a mask that selects all the values in the dataframe that are not the max of the row they're in
  2. Adding the values selected by that mask (the non-max values) to the dataframe again
  3. Summing the resulting rows
  • Related