Making rows NaN based on many conditions-CodePudding

If I have a dataframe with some index and some value as follows:

import pandas as pd
from random import random

my_index = []
my_vals = []
for i in range(1000):
    my_index.append(i random())
    my_vals.append(random())

df_vals = pd.DataFrame({'my_index': my_index, 'my_vals': my_vals})

And I have a second dataframe with a column start and end, a row must be read as an interval, so the first row would be interval from 1 to 4 (including 1 and 4). It is the following dataframe:

df_intervals = pd.DataFrame({'start': [1, 7, 54, 73, 136, 235, 645, 785, 968], 'end': [4, 34, 65, 90, 200, 510, 700, 805, 988]})

I would like to make all values in the my_vals column of df_vals a NaN if the row's index (my_index) does not fall in to one of the intervals specified in the df_intervals dataframe. What is the best way to go about this automatically rather than specifying each condition manually?

(In my actual data set there are more than 9 intervals, this is some example data)

EDIT: in my actual data these indeces are not strictly integers, these can also be random floats

CodePudding user response：

I believe this is a possible solution,

def index_in_range(index, df):
    for index_, row in df.iterrows():
        if (index >= row['start']) and (index <= row['end']):
            return True
    
    return False
    
df_vals['my_vals'] = df_vals.apply(lambda row: row['my_vals'] if index_in_range(row['my_index'], df_intervals) else None, axis=1)

To accomplish this without using a lambda function, you can do the following,

def index_in_range(index, df):
    for index_, row in df.iterrows():
        if (index >= row['start']) and (index <= row['end']):
            return True
    
    return False

for index_, row in df_vals.iterrows():
    if not index_in_range(row['my_index'], df_intervals):
        df_vals.at[index_, 'my_vals'] = None

Output:

    my_index    my_vals
0   0   NaN
1   1   0.126647
2   2   0.769215
3   3   0.819891
4   4   0.674466
... ... ...
995 995 NaN
996 996 NaN
997 997 NaN
998 998 NaN
999 999 NaN

CodePudding user response：

One way would be to create an array that would include all the values of your intervals, for example when start = 1 and end = 4, the array would be [1,2,3,4]. Similarly when start = 7 and end = 34, the array would be [7,8,9,10 ... , 34].

intervals_exp = df_intervals.apply(lambda row: [n for n in range(row['start'],row['end'] 1)],axis=1).explode().values
intervals_exp 

array([1, 2, 3, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
       22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 54, 55, 56, 57, 58,
       59, 60, 61, 62, 63, 64, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, .. 986, 987], dtype=object)]

Then you can use isin to check which values of 'my_index' are in the above array using np.where:

df_vals['my_vals refined'] = np.where(df_vals['my_index'].isin(intervals_exp),df_vals['my_index'],np.nan)

which prints:

     my_index   my_vals  my_vals refined
0           0  0.564172              NaN
1           1  0.852806              1.0
2           2  0.643407              2.0
3           3  0.719642              3.0
4           4  0.233949              4.0
..        ...       ...              ...
995       995  0.355014              NaN
996       996  0.842957              NaN
997       997  0.143479              NaN
998       998  0.915176              NaN
999       999  0.147195              NaN

[1000 rows x 3 columns]