If I have a dataframe with some index and some value as follows:
import pandas as pd
from random import random
my_index = []
my_vals = []
for i in range(1000):
my_index.append(i random())
my_vals.append(random())
df_vals = pd.DataFrame({'my_index': my_index, 'my_vals': my_vals})
And I have a second dataframe with a column start
and end
, a row must be read as an interval, so the first row would be interval from 1
to 4
(including 1 and 4). It is the following dataframe:
df_intervals = pd.DataFrame({'start': [1, 7, 54, 73, 136, 235, 645, 785, 968], 'end': [4, 34, 65, 90, 200, 510, 700, 805, 988]})
I would like to make all values in the my_vals
column of df_vals
a NaN
if the row's index (my_index
) does not fall in to one of the intervals specified in the df_intervals
dataframe. What is the best way to go about this automatically rather than specifying each condition manually?
(In my actual data set there are more than 9 intervals, this is some example data)
EDIT: in my actual data these indeces are not strictly integers, these can also be random floats
CodePudding user response:
I believe this is a possible solution,
def index_in_range(index, df):
for index_, row in df.iterrows():
if (index >= row['start']) and (index <= row['end']):
return True
return False
df_vals['my_vals'] = df_vals.apply(lambda row: row['my_vals'] if index_in_range(row['my_index'], df_intervals) else None, axis=1)
To accomplish this without using a lambda
function, you can do the following,
def index_in_range(index, df):
for index_, row in df.iterrows():
if (index >= row['start']) and (index <= row['end']):
return True
return False
for index_, row in df_vals.iterrows():
if not index_in_range(row['my_index'], df_intervals):
df_vals.at[index_, 'my_vals'] = None
Output:
my_index my_vals
0 0 NaN
1 1 0.126647
2 2 0.769215
3 3 0.819891
4 4 0.674466
... ... ...
995 995 NaN
996 996 NaN
997 997 NaN
998 998 NaN
999 999 NaN
CodePudding user response:
One way would be to create an array that would include all the values of your intervals, for example when start = 1 and end = 4, the array would be [1,2,3,4]
. Similarly when start = 7 and end = 34, the array would be [7,8,9,10 ... , 34]
.
intervals_exp = df_intervals.apply(lambda row: [n for n in range(row['start'],row['end'] 1)],axis=1).explode().values
intervals_exp
array([1, 2, 3, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 54, 55, 56, 57, 58,
59, 60, 61, 62, 63, 64, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, .. 986, 987], dtype=object)]
Then you can use isin
to check which values of 'my_index' are in the above array
using np.where
:
df_vals['my_vals refined'] = np.where(df_vals['my_index'].isin(intervals_exp),df_vals['my_index'],np.nan)
which prints:
my_index my_vals my_vals refined
0 0 0.564172 NaN
1 1 0.852806 1.0
2 2 0.643407 2.0
3 3 0.719642 3.0
4 4 0.233949 4.0
.. ... ... ...
995 995 0.355014 NaN
996 996 0.842957 NaN
997 997 0.143479 NaN
998 998 0.915176 NaN
999 999 0.147195 NaN
[1000 rows x 3 columns]