I have a dataframe such as :
COL1 start1 end1 start2 end2
A 5000 6000 5000 6500
B 5000 6000 4550 6000
C 5000 6000 2000 5300
D 5000 6000 5900 8000
E 5000 6000 5600 5800
F 5000 6000 5000 6000
G 5000 6000 4000 7000
And I would like to add a new column NEWCOL
where I do :
If start1 ==start2 & end1 == end2 :
NEWCOL = 1
elif start1==star2 & end2 > start1 | end1 == end2 & start1 < start2:
NEWCOL= (end2-start2) / (end1-start1)
elif start2 < start1 & end2 > end1:
NEWCOL = (end2-start2) / (end1-start1)
elif start1 < start2 & end1 > end2:
NEWCOL = (end2-start2) / (end1-start1)
elif start1 > start2 & end1 > end2 :
NEWCOL = (end2-start1) / end1-start1
elif start1 < start2 & end1 < end2 :
NEWCOL = (end1-start2) / (end1-start1)
I should then get:
COL1 start1 end1 start2 end2 NEWCOL
A 5000 6000 5000 6500 1.5
B 5000 6000 4550 6000 1.45
C 5000 6000 2000 5300 0.3
D 5000 6000 5900 8000 0.1
E 5000 6000 5600 5800 0.2
F 5000 6000 5000 6000 1
G 5000 6000 4000 7000 3
CodePudding user response:
Solution with custom function is possible, but slow if larger DataFrame
:
def f(x):
start1 = x['start1']
start2 = x['start2']
end1 = x['end1']
end2 = x['end2']
if start1 ==start2 and end1 == end2 :
return 1
elif (start1==start2 and end2 > start1) or (end1 == end2 and start1 < start2):
return (end2-start2) / (end1-start1)
elif start2 < start1 and end2 > end1:
return (end2-start2) / (end1-start1)
elif start1 < start2 and end1 > end2:
return (end2-start2) / (end1-start1)
elif start1 > start2 and end1 > end2 :
return (end2-start1) / (end1-start1)
elif start1 < start2 and end1 < end2 :
return (end1-start2) / (end1-start1)
For improve performance is used numpy.select
:
m1 = (df.start1 == df.start2) & (df.end1 == df.end2 )
s1 = 1
m2 = ((df.start1==df.start2) & (df.end2 > df.start1)) | ((df.end1 == df.end2) & (df.start1 < df.start2))
s2= (df.end2-df.start2) / (df.end1-df.start1)
m3 = (df.start2 < df.start1) & (df.end2 > df.end1)
s3= (df.end2-df.start2) / (df.end1-df.start1)
m4 = (df.start1 < df.start2) & (df.end1 > df.end2)
s4= (df.end2-df.start2) / (df.end1-df.start1)
m5 = (df.start1 > df.start2) & (df.end1 > df.end2)
s5= (df.end2-df.start1) / (df.end1-df.start1 )
m6 = (df.start1 < df.start2) & (df.end1 < df.end2)
s6= (df.end1-df.start2) / (df.end1-df.start1)
masks = [m1,m2,m3,m4,m5,m6]
vals = [s1,s2,s3,s4,s5,s6]
df['VAL'] = np.select(masks, vals, default=np.nan)
df['val1'] = df.apply(f, axis=1)
print (df)
COL1 start1 end1 start2 end2 VAL val1
0 A 5000 6000 5000 6500 1.5 1.5
1 B 5000 6000 4550 6000 NaN NaN
2 C 5000 6000 2000 5300 0.3 0.3
3 D 5000 6000 5900 8000 0.1 0.1
4 E 5000 6000 5600 5800 0.2 0.2
5 F 5000 6000 5000 6000 1.0 1.0
6 G 5000 6000 4000 7000 3.0 3.0
Performance:
#70k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [111]: %timeit df['VAL'] = np.select(masks, vals, default=np.nan)
1.79 ms ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [112]: %timeit df['val1'] = df.apply(f, axis=1)
1.41 s ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
Another vectorized option is case_when from pyjanitor, which is similar to SQL's case_when or python's ifelse, or np.select:
# pip install pyjanitor
import pandas as pd
import janitor
# reusing @jezrael's already coded options :)
df.case_when(
m1,s1, # condition, value if True
m2,s2,
m3,s3,
m4,s4,
m5,s5,
m6,s6,
np.nan, # default if False
column_name='col')
COL1 start1 end1 start2 end2 NEWCOL
0 A 5000 6000 5000 6500 1.5
1 B 5000 6000 4550 6000 NaN
2 C 5000 6000 2000 5300 0.3
3 D 5000 6000 5900 8000 0.1
4 E 5000 6000 5600 5800 0.2
5 F 5000 6000 5000 6000 1.0
6 G 5000 6000 4000 7000 3.0