Home > database >  Making rows NaN based on many conditions
Making rows NaN based on many conditions

Time:11-25

If I have a dataframe with some index and some value as follows:

import pandas as pd
from random import random

my_index = []
my_vals = []
for i in range(1000):
    my_index.append(i random())
    my_vals.append(random())

df_vals = pd.DataFrame({'my_index': my_index, 'my_vals': my_vals})

And I have a second dataframe with a column start and end, a row must be read as an interval, so the first row would be interval from 1 to 4 (including 1 and 4). It is the following dataframe:

df_intervals = pd.DataFrame({'start': [1, 7, 54, 73, 136, 235, 645, 785, 968], 'end': [4, 34, 65, 90, 200, 510, 700, 805, 988]})

I would like to make all values in the my_vals column of df_vals a NaN if the row's index (my_index) does not fall in to one of the intervals specified in the df_intervals dataframe. What is the best way to go about this automatically rather than specifying each condition manually?

(In my actual data set there are more than 9 intervals, this is some example data)

EDIT: in my actual data these indeces are not strictly integers, these can also be random floats

CodePudding user response:

I believe this is a possible solution,

def index_in_range(index, df):
    for index_, row in df.iterrows():
        if (index >= row['start']) and (index <= row['end']):
            return True
    
    return False
    
df_vals['my_vals'] = df_vals.apply(lambda row: row['my_vals'] if index_in_range(row['my_index'], df_intervals) else None, axis=1)

To accomplish this without using a lambda function, you can do the following,

def index_in_range(index, df):
    for index_, row in df.iterrows():
        if (index >= row['start']) and (index <= row['end']):
            return True
    
    return False

for index_, row in df_vals.iterrows():
    if not index_in_range(row['my_index'], df_intervals):
        df_vals.at[index_, 'my_vals'] = None

Output:

    my_index    my_vals
0   0   NaN
1   1   0.126647
2   2   0.769215
3   3   0.819891
4   4   0.674466
... ... ...
995 995 NaN
996 996 NaN
997 997 NaN
998 998 NaN
999 999 NaN

CodePudding user response:

One way would be to create an array that would include all the values of your intervals, for example when start = 1 and end = 4, the array would be [1,2,3,4]. Similarly when start = 7 and end = 34, the array would be [7,8,9,10 ... , 34].

intervals_exp = df_intervals.apply(lambda row: [n for n in range(row['start'],row['end'] 1)],axis=1).explode().values
intervals_exp 

array([1, 2, 3, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
       22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 54, 55, 56, 57, 58,
       59, 60, 61, 62, 63, 64, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, .. 986, 987], dtype=object)]

Then you can use isin to check which values of 'my_index' are in the above array using np.where:

df_vals['my_vals refined'] = np.where(df_vals['my_index'].isin(intervals_exp),df_vals['my_index'],np.nan)

which prints:

     my_index   my_vals  my_vals refined
0           0  0.564172              NaN
1           1  0.852806              1.0
2           2  0.643407              2.0
3           3  0.719642              3.0
4           4  0.233949              4.0
..        ...       ...              ...
995       995  0.355014              NaN
996       996  0.842957              NaN
997       997  0.143479              NaN
998       998  0.915176              NaN
999       999  0.147195              NaN

[1000 rows x 3 columns]
  • Related