I have these two datasets:
val_1 = [81.322, 81.342, 81.452, 81.552, 82.522, 82.562, 82.722, 81.723, 81.322, 81.332 ]
val_2 = [0.234, 0.231, 0.333, 0.324, 0.666, 0.645, 0.334, 0.345, 0.878, 0.888]
data_1 = pd.DataFrame({'val_1': val_1, 'val_2': val_2})
data_1
val_1 val_2
0 81.322 0.234
1 81.342 0.231
2 81.452 0.333
3 81.552 0.324
4 82.522 0.666
5 82.562 0.645
6 82.722 0.334
7 81.723 0.345
8 81.322 0.878
9 81.332 0.888
and
val_1_start = [81.100,81.41, 81.99, 81.320 ]
val_2_start = [0.230,0.331,0.32, 0.875 ]
val_1_end = [81.400,81.554,82.8, 81.333 ]
val_2_end = [0.281,0.335,0.68, 0.890]
value = [24,21,34,11]
data_2 = pd.DataFrame({'val_1_start': val_1_start, 'val_2_start': val_2_start,
'val_1_end': val_1_end, 'val_2_end':val_2_end, 'value': value})
data_2
val_1_start val_2_start val_1_end val_2_end value
0 81.10 0.230 81.400 0.281 24
1 81.41 0.331 81.554 0.335 21
2 81.99 0.320 82.800 0.680 34
3 81.32 0.875 81.333 0.890 11
I am trying to find where val_1 is between val_1_start and val_1_end and val_2 is between val_2_start and val_2_end. Then I would add the value to the first dataset.
I wouldn't mind either output here
val_1 val_2 value
81.322 0.234 24
81.342 0.231 24
81.452 0.333 21
81.552 0.324 NA
82.522 0.666 34
82.562 0.645 34
82.722 0.334 34
81.723 0.345 NA
81.322 0.878 11
81.332 0.888 11
val_1 val_2 value
81.322 0.234 24
81.342 0.231 24
81.452 0.333 21
82.522 0.666 34
82.562 0.645 34
82.722 0.334 34
81.322 0.878 11
81.332 0.888 11
Please let me know how I can do this
CodePudding user response:
Use cross join in DataFrame.merge
and then filter by boolean indexing
:
df = data_1.merge(data_2, how='cross')
m = (df['val_1'].between(df['val_1_start'], df['val_1_end']) &
df['val_2'].between(df['val_2_start'], df['val_2_end']))
df = df.loc[m, ['val_1','val_2','value']]
print (df)
val_1 val_2 value
0 81.322 0.234 24
4 81.342 0.231 24
9 81.452 0.333 21
18 82.522 0.666 34
22 82.562 0.645 34
26 82.722 0.334 34
35 81.322 0.878 11
39 81.332 0.888 11
CodePudding user response:
One option is the conditional_join from pyjanitor; it tries to avoid a cartesian join for non-equi joins ( to reduce memory consumption):
# pip install git https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
(data_1.conditional_join(
data_2,
# column from left, column from right, join type
('val_1', 'val_1_start', '>='),
('val_1', 'val_1_end', '<='),
('val_2', 'val_2_start', '>='),
('val_2', 'val_2_end', '<=')
)
.select_columns(data_1.columns, 'value')
)
val_1 val_2 value
0 81.322 0.234 24
1 81.342 0.231 24
2 81.452 0.333 21
3 82.522 0.666 34
4 82.562 0.645 34
5 82.722 0.334 34
6 81.322 0.878 11
7 81.332 0.888 11
for a left join, pass left
to the how
parameter:
(data_1.conditional_join(
data_2,
('val_1', 'val_1_start', '>='),
('val_1', 'val_1_end', '<='),
('val_2', 'val_2_start', '>='),
('val_2', 'val_2_end', '<='),
how = 'left'
)
.select_columns(data_1.columns, 'value')
)
val_1 val_2 value
0 81.322 0.234 24.0
1 81.342 0.231 24.0
2 81.452 0.333 21.0
3 81.552 0.324 NaN
4 82.522 0.666 34.0
5 82.562 0.645 34.0
6 82.722 0.334 34.0
7 81.723 0.345 NaN
8 81.322 0.878 11.0
9 81.332 0.888 11.0
Note that at the moment, you have to install the pyjanitor dev version to use this function. pip install git https://github.com/pyjanitor-devs/pyjanitor.git
This page also has options that you may find useful/relevant.