Home > Enterprise >  Efficient way to apply function with multiple operations on dataframe row
Efficient way to apply function with multiple operations on dataframe row

Time:11-13

I have a pandas dataframe that looks like this:

            X[m]      Y[m]      Z[m]  ...      beta  newx  newy
0       1.439485  0.087100  0.029771  ...  0.063807  1439    87
1       1.439485  0.089729  0.029121  ...  0.065871  1439    89
2       1.439485  0.091992  0.030059  ...  0.067653  1439    91
3       1.439485  0.082073  0.030721  ...  0.059883  1439    82
4       1.439485  0.084095  0.028952  ...  0.061458  1439    84
5       1.439485  0.085937  0.028019  ...  0.062897  1439    85

There are hundreds of thousands of such lines, while I have multiple dataframes like this. X and Y are coordinates on plane (Z is not important) that is moved 45 degrees by the middle to the right. I need to put all points back to the original place, -45 degrees from its location. I have variables newx and newy that represent coordinates before changing, I want to edit these two columns to have values of new coordinates. As I know coordinates of middle point, the point itself, the angle of middle-to-point (alpha) and angle middle-to-fixedpoint (beta), I can use approach presented in mathematics SO. I have transformed the code to python like this:

for i in range(len(df)):
    if df.iloc[i].alpha == math.pi/2 or df.iloc[i].alpha == 3*math.pi/2:
        df.newx[i] = mid
        df.newy[i] = int(math.tan(df.iloc[i].beta*(df.iloc[i].x-mid) mid))
    elif df.iloc[i].beta == math.pi/2 or df.iloc[i].beta == 3*math.pi/2:
        #df.newx[i] = df.iloc[i].x -- this is already set
        df.newy[i] = int(math.tan(df.iloc[i].alpha*(mid-df.iloc[i].x) mid))
    else:
        m0 = math.tan(df.iloc[i].alpha)
        m1 = math.tan(df.iloc[i].beta)
        x = ((m0 * df.iloc[i].x - m1 * mid) - (df.iloc[i].y - mid)) / (m0 - m1)
        df.newx[i] = int(x)
        df.newy[i] = int(m0 * (x - df.iloc[i].x)   df.iloc[i].y)

Although this does what I need and moves the point to the correct position, the time complexity is enormous and I have too much files to proceed it like this. I know that there are way faster methods, such as serialization, apply and list comprehension. I however can't figure out how to use it with this function.

Here are first 10 lines as dictionary:

{'X[m]': {0: 1.439484727008419, 1: 1.439484727008419, 2: 1.439484727008419, 3: 1.439484727008419, 4: 1.439484727008419, 5: 1.439484727008419, 6: 1.439484727008419, 7: 1.439484727008419, 8: 1.439484727008419, 9: 1.439484727008419}, 'Y[m]': {0: 0.08709958190841899, 1: 0.08972904270841897, 2: 0.091991981408419, 3: 0.08207325440841898, 4: 0.08409548540841899, 5: 0.08593746080841899, 6: 0.09416210370841899, 7: 0.08874029660841898, 8: 0.09168940400841899, 9: 0.09434491760841898}, 'Z[m]': {0: 0.029770726299999998, 1: 0.0291213803, 2: 0.030058834700000002, 3: 0.0307212565, 4: 0.028951926200000002, 5: 0.0280194897, 6: 0.030717188500000003, 7: 0.026446931099999998, 8: 0.0269318204, 9: 0.0273838975}, 'Velocity[ms^-1]': {0: ['-1.67570162e 00', '-2.59946979e-15', '-2.54510192e-15'], 1: ['-1.63915336e 00', '-2.54277343e-15', '-2.48959140e-15'], 2: ['-1.69191790e 00', '-2.62462561e-15', '-2.56973173e-15'], 3: ['-1.72920227e 00', '-2.68246377e-15', '-2.62636012e-15'], 4: ['-1.62961555e 00', '-2.52797767e-15', '-2.47510523e-15'], 5: ['-1.57713342e 00', '-2.44656340e-15', '-2.39539372e-15'], 6: ['-1.72897375e 00', '-2.68210929e-15', '-2.62601305e-15'], 7: ['-1.48862195e 00', '-2.30925809e-15', '-2.26096006e-15'], 8: ['-1.51591396e 00', '-2.35159534e-15', '-2.30241195e-15'], 9: ['-1.54135919e 00', '-2.39106792e-15', '-2.34105888e-15']}, 'L': {0: 0.9582306809661671, 1: 0.9564957485824027, 2: 0.9550059224371557, 3: 0.9615583774318917, 4: 0.9602177760259737, 5: 0.9589987519260235, 6: 0.9535800607266656, 7: 0.9571476500665267, 8: 0.9552049510914844, 9: 0.953460072490227}, 'x': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'y': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}, 'alpha': {0: -0.7215912027987663, 1: -0.719527331916007, 2: -0.7177451479100487, 3: -0.7255156166536015, 4: -0.7239399868865558, 5: -0.7225009735356016, 6: -0.7160308360594005, 7: -0.7203042790640757, 8: -0.7179837655204843, 9: -0.7158861861473951}, 'beta': {0: 0.06380696059868196, 1: 0.06587083148144124, 2: 0.06765301548739955, 3: 0.05988254674384674, 4: 0.06145817651089247, 5: 0.06289718986184667, 6: 0.06936732733804774, 7: 0.0650938843333726, 8: 0.06741439787696402, 9: 0.0695119772500532}, 'newx': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'newy': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}}

CodePudding user response:

Same approach as @Joshua Voskamp, but I still wanted to share

import pandas as pd
import numpy as np
import math

df = pd.DataFrame({'X[m]': {0: 1.439484727008419, 1: 1.439484727008419, 2: 1.439484727008419, 3: 1.439484727008419, 4: 1.439484727008419, 5: 1.439484727008419, 6: 1.439484727008419, 7: 1.439484727008419, 8: 1.439484727008419, 9: 1.439484727008419}, 'Y[m]': {0: 0.08709958190841899, 1: 0.08972904270841897, 2: 0.091991981408419, 3: 0.08207325440841898, 4: 0.08409548540841899, 5: 0.08593746080841899, 6: 0.09416210370841899, 7: 0.08874029660841898, 8: 0.09168940400841899, 9: 0.09434491760841898}, 'Z[m]': {0: 0.029770726299999998, 1: 0.0291213803, 2: 0.030058834700000002, 3: 0.0307212565, 4: 0.028951926200000002, 5: 0.0280194897, 6: 0.030717188500000003, 7: 0.026446931099999998, 8: 0.0269318204, 9: 0.0273838975}, 'Velocity[ms^-1]': {0: ['-1.67570162e 00', '-2.59946979e-15', '-2.54510192e-15'], 1: ['-1.63915336e 00', '-2.54277343e-15', '-2.48959140e-15'], 2: ['-1.69191790e 00', '-2.62462561e-15', '-2.56973173e-15'], 3: ['-1.72920227e 00', '-2.68246377e-15', '-2.62636012e-15'], 4: ['-1.62961555e 00', '-2.52797767e-15', '-2.47510523e-15'], 5: ['-1.57713342e 00', '-2.44656340e-15', '-2.39539372e-15'], 6: ['-1.72897375e 00', '-2.68210929e-15', '-2.62601305e-15'], 7: ['-1.48862195e 00', '-2.30925809e-15', '-2.26096006e-15'], 8: ['-1.51591396e 00', '-2.35159534e-15', '-2.30241195e-15'], 9: ['-1.54135919e 00', '-2.39106792e-15', '-2.34105888e-15']}, 'L': {0: 0.9582306809661671, 1: 0.9564957485824027, 2: 0.9550059224371557, 3: 0.9615583774318917, 4: 0.9602177760259737, 5: 0.9589987519260235, 6: 0.9535800607266656, 7: 0.9571476500665267, 8: 0.9552049510914844, 9: 0.953460072490227}, 'x': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'y': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}, 'alpha': {0: -0.7215912027987663, 1: -0.719527331916007, 2: -0.7177451479100487, 3: -0.7255156166536015, 4: -0.7239399868865558, 5: -0.7225009735356016, 6: -0.7160308360594005, 7: -0.7203042790640757, 8: -0.7179837655204843, 9: -0.7158861861473951}, 'beta': {0: 0.06380696059868196, 1: 0.06587083148144124, 2: 0.06765301548739955, 3: 0.05988254674384674, 4: 0.06145817651089247, 5: 0.06289718986184667, 6: 0.06936732733804774, 7: 0.0650938843333726, 8: 0.06741439787696402, 9: 0.0695119772500532}, 'newx': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'newy': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}})

mid = 0 #not sure what mid value should be

near_threshold = 0.001

alpha_near_half_pi = df.alpha.sub(math.pi/2).abs().le(near_threshold)
alpha_near_three_half_pi = df.alpha.sub(3*math.pi/2).abs().le(near_threshold)
beta_near_half_pi = df.beta.sub(math.pi/2).abs().le(near_threshold)
beta_near_three_half_pi = df.beta.sub(3*math.pi/2).abs().le(near_threshold)

cond1 = alpha_near_half_pi | alpha_near_three_half_pi
cond2 = beta_near_half_pi | beta_near_three_half_pi
cond2 = cond2 & (~cond1) #if cond1 is true, we don't want to do cond2
cond3 = ~(cond1 | cond2) #if neither cond1 nor cond2, then we are in cond3

#Process cond1 rows
c1 = df.loc[cond1]
df.loc[cond1,'newx'] = mid
df.loc[cond1,'newy'] = np.tan(c1.beta*(c1.x-mid) mid)

#Process cond2 rows
c2 = df.loc[cond2]
df.loc[cond2,'newy'] = np.tan(c2.alpha*(mid-c2.x) mid)

#Process cond3 rows
c3 = df.loc[cond3]
m0 = np.tan(c3.alpha)
m1 = np.tan(c3.beta)

#                       Is this a mistake? always 0?
#                                   |
#                             --------------
x = ((m0 * c3.x - m1 * mid) - (c3.y - c3.y)) / (m0 - m1)
df.loc[cond3,'newx'] = x.astype(int)
df.loc[cond3,'newy'] = (m0 * (x - c3.x)   c3.y).astype(int)

df

CodePudding user response:

TL;DR this was a mess; then I borrowed some ideas from @mitoRibo. Go vote up on their answer. Both of us used a strategy of "selectively calculate newx/newy using masking, where the mask is equivalent to the if/elif/else condition provided".

#setup
import pandas as pd
import numpy as np
import math

df = pd.DataFrame({'X[m]': {0: 1.439484727008419, 1: 1.439484727008419, 2: 1.439484727008419, 3: 1.439484727008419, 4: 1.439484727008419, 5: 1.439484727008419, 6: 1.439484727008419, 7: 1.439484727008419, 8: 1.439484727008419, 9: 1.439484727008419}, 'Y[m]': {0: 0.08709958190841899, 1: 0.08972904270841897, 2: 0.091991981408419, 3: 0.08207325440841898, 4: 0.08409548540841899, 5: 0.08593746080841899, 6: 0.09416210370841899, 7: 0.08874029660841898, 8: 0.09168940400841899, 9: 0.09434491760841898}, 'Z[m]': {0: 0.029770726299999998, 1: 0.0291213803, 2: 0.030058834700000002, 3: 0.0307212565, 4: 0.028951926200000002, 5: 0.0280194897, 6: 0.030717188500000003, 7: 0.026446931099999998, 8: 0.0269318204, 9: 0.0273838975}, 'Velocity[ms^-1]': {0: ['-1.67570162e 00', '-2.59946979e-15', '-2.54510192e-15'], 1: ['-1.63915336e 00', '-2.54277343e-15', '-2.48959140e-15'], 2: ['-1.69191790e 00', '-2.62462561e-15', '-2.56973173e-15'], 3: ['-1.72920227e 00', '-2.68246377e-15', '-2.62636012e-15'], 4: ['-1.62961555e 00', '-2.52797767e-15', '-2.47510523e-15'], 5: ['-1.57713342e 00', '-2.44656340e-15', '-2.39539372e-15'], 6: ['-1.72897375e 00', '-2.68210929e-15', '-2.62601305e-15'], 7: ['-1.48862195e 00', '-2.30925809e-15', '-2.26096006e-15'], 8: ['-1.51591396e 00', '-2.35159534e-15', '-2.30241195e-15'], 9: ['-1.54135919e 00', '-2.39106792e-15', '-2.34105888e-15']}, 'L': {0: 0.9582306809661671, 1: 0.9564957485824027, 2: 0.9550059224371557, 3: 0.9615583774318917, 4: 0.9602177760259737, 5: 0.9589987519260235, 6: 0.9535800607266656, 7: 0.9571476500665267, 8: 0.9552049510914844, 9: 0.953460072490227}, 'x': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'y': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}, 'alpha': {0: -0.7215912027987663, 1: -0.719527331916007, 2: -0.7177451479100487, 3: -0.7255156166536015, 4: -0.7239399868865558, 5: -0.7225009735356016, 6: -0.7160308360594005, 7: -0.7203042790640757, 8: -0.7179837655204843, 9: -0.7158861861473951}, 'beta': {0: 0.06380696059868196, 1: 0.06587083148144124, 2: 0.06765301548739955, 3: 0.05988254674384674, 4: 0.06145817651089247, 5: 0.06289718986184667, 6: 0.06936732733804774, 7: 0.0650938843333726, 8: 0.06741439787696402, 9: 0.0695119772500532}})

# make the new columns
df['newx'] = np.nan
df['newy'] = np.nan
# if any of the values are np.nan when we're done, something went wrong

# Do the float `between` comparison but cleverly
EPSILON = 1e-6
(sides, directions) = ((1, -1), (1, 3))
windows = tuple(tuple(d*math.pi/2   s*EPSILON for s in sides) for d in directions)
# challenge: make this more DRY (don't repeat yourself)
alpha_vertical = sum([df.alpha.between(*w) for w in windows]).astype(bool)
beta_vertical  = sum([ df.beta.between(*w) for w in windows]).astype(bool)\
                 & ~alpha_vertical
neither = (~alpha_vertical & ~beta_vertical)

# Handle `alpha_is_in_y_axis`:
c1 = df.loc[alpha_vertical]
df.loc[alpha_vertical,'newx'] = mid
df.loc[alpha_vertical,'newy'] = np.tan(c1.beta*(c1.x - mid)   mid).astype(int)

# Handle `beta_is_in_y_axis`:
c2 = df.loc[beta_vertical]
# ignore the x-values
df.loc[beta_vertical,'newy'] = np.tan(c2.alpha*(mid - c2.x)   mid).astype(int)

# Handle the other cases:
c3 = df.loc[neither]
m0 = np.tan(c3.alpha)
m1 = np.tan(c3.beta)
t = ((m0 * c3.x - m1 * mid) - (c3.y - mid)) / (m0 - m1)

df.loc[neither,'newx'] = t.astype(int)
df.loc[neither,'newy'] = (m0 * (t - c3.x)   c3.y).astype(int)
  • Related