How to remove df rows if the timestamp is inside at least one of given time intervals?-CodePudding

I'm given two (pandas) data frames:

data: with time-stamps columns ts and other data columns val1, and val2.
intervals: with two columns, start and end, that describe a series of time intervals in which data is corrupted

my goal - replace all row values in data with NaN if that row's ts is within some range described by start, end. (Namely, replace the i'th row in data with [NaN,...,NaN] if there exists some row j in intervals such that the ts_i is in [start_j,end_j])

My current (naive) solution, iterates the ranges using a for loop:

def _remove_corruptions(corruption_filename: str, df_to_clean: pd.DataFrame):
    corruption_timestamps = pd.read_csv(corruption_filename)
    for j, row in tqdm(corruption_timestamps.iterrows()):
        start, end = row['start'], row['end']
        df_to_clean[(start <= df_to_clean['ts']) & (df_to_clean['ts'] <= end)] = np.nan
    return df_to_clean

Can I vectorize this process somehow?

an example is provided below:

intervals = pd.DataFrame({'start': ['2019-06-23', '2020-01-10'], 'end': ['2019-06-24', '2020-01-12']})

        start         end
0  2019-06-23  2019-06-24
1  2020-01-10  2020-01-12

data = pd.DataFrame({'ts': ['2019-06-23', '2019-10-24', '2020-01-11'], 'val1': [1, 1, 1], 'val2': [2, 2, 2]})

           ts  val1  val2
0  2019-06-23     1     2
1  2019-10-24     1     2
2  2020-01-11     1     2

required out:

out = pd.DataFrame({'ts': [np.nan, '2019-10-24', np.nan], 'val1': [np.nan, 1, np.nan], 'val2': [np.nan, 2, np.nan]})

           ts  val1  val2
0         NaN   NaN   NaN
1  2019-10-24   1.0   2.0
2         NaN   NaN   NaN

CodePudding user response：

import pandas as pd
import numpy as np

intervals = pd.DataFrame({'start': ['2019-06-23', '2020-01-10'], 'end': ['2019-06-24', '2020-01-12']})
data = pd.DataFrame({'ts': ['2019-06-23', '2019-10-24', '2020-01-11'], 'val1': [1, 1, 1], 'val2': [2, 2, 2]})


def find_corrupted(x):
    # returns a list of all data index numbers where data.ts are between start and end
    return data[((data.ts >= x['start']) & (data.ts <= x['end']))].index.values

# applying a list of index numbers to a new column in intervals
intervals['corrupt'] = intervals.apply(lambda x: find_corrupted(x), axis=1)

# creating list of unique index numbers
mask = intervals.corrupt.explode().unique()


data.loc[data.iloc[mask].index, :] = np.nan

OUTPUT

          ts  val1  val2
0         NaN   NaN   NaN
1  2019-10-24   1.0   2.0
2         NaN   NaN   NaN

CodePudding user response：

You can use:

interval_dict = dict(zip(intervals['start'],intervals['end'])) # {'2019-06-23': '2019-06-24', '2020-01-10': '2020-01-12'}
data['check'] = data['ts'].apply(lambda x: any(x >= k and x <= v for k,v in interval_dict.items())) # check ts value. If ts value in range return True
data.loc[data['check'].eq(True), data.columns] = np.nan # if check is True replace rows with nan

data = data.drop('check',axis=1)

Or one line:

data.loc[data['ts'].apply(lambda x: any(x >= k and x <= v for k,v in dict(zip(intervals['start'],intervals['end'])).items())).eq(True), data.columns] = np.nan

Output:

          ts  val1  val2
0        NaT   NaN   NaN
1 2019-10-24   1.0   2.0 
2        NaT   NaN   NaN

Note: Make sure the date columns are of type datetime before you start.

intervals['start']=pd.to_datetime(intervals['start'])
intervals['end']=pd.to_datetime(intervals['end'])
data['ts']=pd.to_datetime(data['ts'])