Home > Net >  Check if value in pandas dataframe is within any two values of two other columns in another datafram
Check if value in pandas dataframe is within any two values of two other columns in another datafram

Time:11-05

I have two dataframes of different length. dfSamples (63012375 rows) and dfFixations (200000 rows).

dfSamples = pd.DataFrame({'tSample':[4, 6, 8, 10, 12, 14]})  
dfFixations = pd.DataFrame({'tStart':[4,12],'tEnd':[8,14]})

I would like to check each value in dfSamples if it is within any two ranges given in dfFixations and then assign a label to this value. I have found this: Check if value in a dataframe is between two values in another dataframe, but the loop solution is terribly slow and I cannot make any other solution work.

Working (but very slow) example:

labels = np.empty_like(dfSamples['tSample']).astype(np.chararray)
for i, fixation in dfFix.iterrows():
    log_range = dfSamples['tSample'].between(fixation['tStart'], fixation['tEnd'])
    labels[log_range] = 'fixation'
labels[labels != 'fixation'] = 'no_fixation'
dfSamples['labels'] = labels

Following this example: Performance of Pandas apply vs np.vectorize to create new column from existing columns I have tried to vectorize this but with no success.

def check_range(samples, tstart, tend):
    log_range = (samples > tstart) & (samples < tend)
    return log_range
fixations = list(map(check_range, dfSamples['tSample'], dfFix['tStart'], dfFix['tEnd']))

Would appreciate any help!

CodePudding user response:

Use IntervalIndex.from_arrays with IntervalIndex.get_indexer , if not match is returned -1, so checked and set ouput in numpy.where:

i = pd.IntervalIndex.from_arrays(dfFixations['tStart'],
                                 dfFixations['tEnd'], 
                                 closed="both")
pos = i.get_indexer(dfSamples['tSample'])
dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation")

print (dfSamples)
   tSample       labels
0        4     fixation
1        6     fixation
2        8     fixation
3       10  no_fixation
4       12     fixation
5       14     fixation

Performance: In ideal nice sorted not overlap data, in real should be performance different, the best test it.

dfSamples = pd.DataFrame({'tSample':range(10000)})  
dfFixations = pd.DataFrame({'tStart':range(0, 10000, 5),'tEnd':range(2, 10000, 5)})
    


In [165]: %%timeit
     ...: labels = np.empty_like(dfSamples['tSample']).astype(np.chararray)
     ...: for i, fixation in dfFixations.iterrows():
     ...:     log_range = dfSamples['tSample'].between(fixation['tStart'], fixation['tEnd'])
     ...:     labels[log_range] = 'fixation'
     ...: labels[labels != 'fixation'] = 'no_fixation'
     ...: dfSamples['labels'] = labels
     ...: 
     ...: 
1.25 s ± 52.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [168]: %%timeit
     ...: ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
     ...: dfSamples["labels1"] =  np.where(dfSamples["tSample"].apply(ii.contains).apply(any), "fixation", "no_fixation")
     ...: 
315 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [170]: %%timeit
     ...: ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
     ...: contained = np.logical_or.reduce(piso.contains(ii, dfSamples["tSample"], include_index=False), axis=0)
     ...: dfSamples["labels1"] = np.where(contained, "fixation", "no_fixation")
     ...: 
82.4 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [166]: %%timeit
     ...: s = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
     ...: pos = s.get_indexer(dfSamples['tSample'])
     ...: dfSamples['labels'] = np.where(pos != -1, "fixation", "no_fixation")
     ...: 
27.8 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response:

setup

dfSamples = pd.DataFrame({'tSample':[4, 6, 8, 10, 12, 14]})  
dfFixations = pd.DataFrame({'tStart':[4,12],'tEnd':[8,14]})

solution

Create an interval index from your start and end points

ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")

ii.contains is a method which checks if a point is contained by each interval in the interval index, eg

dfSamples["tSample"].apply(ii.contains)

gives

0     [True, False]
1     [True, False]
2     [True, False]
3    [False, False]
4     [False, True]
5     [False, True]
Name: tSample, dtype: object

We're going to take this result, apply any function to each element (a list) to get a pandas.Series of booleans, which we can then use with numpy.where

dfSamples["labels"] =  np.where(dfSamples["tSample"].apply(ii.contains).apply(any), "fixation", "no_fixation")

the result

   tSample       labels
0        4     fixation
1        6     fixation
2        8  no_fixation
3       10  no_fixation
4       12     fixation
5       14  no_fixation

edit: faster version

Using piso v0.6.0

import piso
import numpy as np

ii = pd.IntervalIndex.from_arrays(dfFixations['tStart'], dfFixations['tEnd'], closed="both")
contained = np.logical_or.reduce(piso.contains(ii, dfSamples["tSample"], include_index=False), axis=0)
dfSamples["labels"] = np.where(contained, "fixation", "no_fixation")

This will run in a similar time to @jezrael's solution, however it can handle cases where intervals overlaps eg

dfFixations = pd.DataFrame({'tStart':[4,5,12],'tEnd':[8,9,14]})
  • Related