Find where values are between a start and end value and add column python-CodePudding

I have these two datasets:


val_1 = [81.322, 81.342, 81.452, 81.552, 82.522, 82.562, 82.722, 81.723, 81.322, 81.332 ]
val_2 = [0.234, 0.231, 0.333, 0.324, 0.666, 0.645, 0.334, 0.345, 0.878, 0.888]
data_1 = pd.DataFrame({'val_1': val_1, 'val_2': val_2})
data_1

    val_1   val_2
0   81.322  0.234
1   81.342  0.231
2   81.452  0.333
3   81.552  0.324
4   82.522  0.666
5   82.562  0.645
6   82.722  0.334
7   81.723  0.345
8   81.322  0.878
9   81.332  0.888

and


val_1_start = [81.100,81.41, 81.99, 81.320 ]
val_2_start = [0.230,0.331,0.32, 0.875 ]
val_1_end = [81.400,81.554,82.8, 81.333 ]
val_2_end = [0.281,0.335,0.68,  0.890]
value = [24,21,34,11]
data_2 = pd.DataFrame({'val_1_start': val_1_start, 'val_2_start': val_2_start, 
                       'val_1_end': val_1_end, 'val_2_end':val_2_end, 'value': value})
data_2


   val_1_start  val_2_start  val_1_end  val_2_end   value
0   81.10       0.230        81.400     0.281       24
1   81.41       0.331        81.554     0.335       21
2   81.99       0.320        82.800     0.680       34
3   81.32      0.875         81.333     0.890       11

I am trying to find where val_1 is between val_1_start and val_1_end and val_2 is between val_2_start and val_2_end. Then I would add the value to the first dataset.

I wouldn't mind either output here

val_1   val_2    value
81.322  0.234  24
81.342  0.231  24
81.452  0.333  21
81.552  0.324  NA
82.522  0.666  34
82.562  0.645  34
82.722  0.334  34
81.723  0.345  NA
81.322  0.878  11
81.332  0.888  11

val_1   val_2    value
81.322  0.234  24
81.342  0.231  24
81.452  0.333  21
82.522  0.666  34
82.562  0.645  34
82.722  0.334  34
81.322  0.878  11
81.332  0.888  11

Please let me know how I can do this

CodePudding user response：

Use cross join in DataFrame.merge and then filter by boolean indexing:

df = data_1.merge(data_2, how='cross')

m = (df['val_1'].between(df['val_1_start'], df['val_1_end']) &
     df['val_2'].between(df['val_2_start'], df['val_2_end']))
df = df.loc[m, ['val_1','val_2','value']]
print (df)
     val_1  val_2  value
0   81.322  0.234     24
4   81.342  0.231     24
9   81.452  0.333     21
18  82.522  0.666     34
22  82.562  0.645     34
26  82.722  0.334     34
35  81.322  0.878     11
39  81.332  0.888     11

CodePudding user response：

One option is the conditional_join from pyjanitor; it tries to avoid a cartesian join for non-equi joins ( to reduce memory consumption):

# pip install git https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

(data_1.conditional_join(
          data_2,
          # column from left, column from right, join type
          ('val_1', 'val_1_start', '>='),
          ('val_1', 'val_1_end', '<='),
          ('val_2', 'val_2_start', '>='),
          ('val_2', 'val_2_end', '<=')
          )
      .select_columns(data_1.columns, 'value')
)

    val_1  val_2  value
0  81.322  0.234     24
1  81.342  0.231     24
2  81.452  0.333     21
3  82.522  0.666     34
4  82.562  0.645     34
5  82.722  0.334     34
6  81.322  0.878     11
7  81.332  0.888     11

for a left join, pass left to the how parameter:

(data_1.conditional_join(
          data_2,
          ('val_1', 'val_1_start', '>='),
          ('val_1', 'val_1_end', '<='),
          ('val_2', 'val_2_start', '>='),
          ('val_2', 'val_2_end', '<='),
          how = 'left'
          )
      .select_columns(data_1.columns, 'value')
)

    val_1  val_2  value
0  81.322  0.234   24.0
1  81.342  0.231   24.0
2  81.452  0.333   21.0
3  81.552  0.324    NaN
4  82.522  0.666   34.0
5  82.562  0.645   34.0
6  82.722  0.334   34.0
7  81.723  0.345    NaN
8  81.322  0.878   11.0
9  81.332  0.888   11.0

Note that at the moment, you have to install the pyjanitor dev version to use this function. pip install git https://github.com/pyjanitor-devs/pyjanitor.git

This page also has options that you may find useful/relevant.