I have a large dataframe df:
df = pd.DataFrame(
{
"length": [1200,600, 600,1201, 600, 600, 500, 700],
"width": [600, 1200, 600, 600, 1201, 600, 800, 700],
"height": [600, 600, 1200, 600, 600, 1201, 800, 700],
}
)
length | width | height |
---|---|---|
1200 | 600 | 600 |
600 | 1200 | 600 |
600 | 600 | 1200 |
1201 | 600 | 600 |
600 | 1201 | 600 |
600 | 600 | 1201 |
500 | 800 | 800 |
700 | 700 | 700 |
I need to find the highest value per row, then multiply the other with *2 and add all together.
So the formula is:
highest_value (lower_value_1 * 2) (lower_value_2 * 2)
- E.g. for row 1:
1200 (600 * 2) (600 * 2) = 3600
- E.g. for row 7:
800 (800 * 2) (500 * 2) = 3400
- E.g. for row 8:
700 (700 * 2) (700 * 2) = 3500
The output should be:
length | width | height | girth |
---|---|---|---|
1200 | 600 | 600 | 3600 |
600 | 1200 | 600 | 3600 |
600 | 600 | 1200 | 3600 |
1201 | 600 | 600 | 3601 |
600 | 1201 | 600 | 3601 |
600 | 600 | 1201 | 3601 |
500 | 500 | 800 | 3400 |
700 | 700 | 700 | 3500 |
I can find the maximum for each row with
df[["length ", "width ", "height "]].max(axis=1)
but I need help with the rest.
CodePudding user response:
This seems to be a simpler solution
df['New'] = df.sum(axis=1)*2 - df.max(axis=1)
Edit, was meant to copy this
df['Girth'] = df.apply(lambda x: sum(x)*2 - max(x), axis = 1)
CodePudding user response:
Here is a very fast, vectorized solution that takes advantage of numpy broadcasting to achieve very high performance:
mask = pd.DataFrame(df.to_numpy() != df.max(axis=1).to_numpy().reshape(-1,1), columns=df.columns)
df['girth'] = df.add(df[mask].fillna(0)).sum(axis=1)
Output:
>>> df
length width height girth
0 1200 600 600 3600.0
1 600 1200 600 3600.0
2 600 600 1200 3600.0
3 1201 600 600 3601.0
4 600 1201 600 3601.0
5 600 600 1201 3601.0
On my 3.6 GHz 8-Core Intel Core i9, this took about ~950ms for 6,000,000 (6 million) rows.
Basically what we're doing here is:
- Creating a mask that selects all the values in the dataframe that are not the max of the row they're in
- Adding the values selected by that mask (the non-max values) to the dataframe again
- Summing the resulting rows