Home > OS >  How to remove df rows if the timestamp is inside at least one of given time intervals?
How to remove df rows if the timestamp is inside at least one of given time intervals?

Time:12-21

I'm given two (pandas) data frames:

  • data: with time-stamps columns ts and other data columns val1, and val2.
  • intervals: with two columns, start and end, that describe a series of time intervals in which data is corrupted

my goal - replace all row values in data with NaN if that row's ts is within some range described by start, end. (Namely, replace the i'th row in data with [NaN,...,NaN] if there exists some row j in intervals such that the ts_i is in [start_j,end_j])

My current (naive) solution, iterates the ranges using a for loop:

def _remove_corruptions(corruption_filename: str, df_to_clean: pd.DataFrame):
    corruption_timestamps = pd.read_csv(corruption_filename)
    for j, row in tqdm(corruption_timestamps.iterrows()):
        start, end = row['start'], row['end']
        df_to_clean[(start <= df_to_clean['ts']) & (df_to_clean['ts'] <= end)] = np.nan
    return df_to_clean

Can I vectorize this process somehow?


an example is provided below:

intervals = pd.DataFrame({'start': ['2019-06-23', '2020-01-10'], 'end': ['2019-06-24', '2020-01-12']})

        start         end
0  2019-06-23  2019-06-24
1  2020-01-10  2020-01-12

data = pd.DataFrame({'ts': ['2019-06-23', '2019-10-24', '2020-01-11'], 'val1': [1, 1, 1], 'val2': [2, 2, 2]})

           ts  val1  val2
0  2019-06-23     1     2
1  2019-10-24     1     2
2  2020-01-11     1     2

required out:

out = pd.DataFrame({'ts': [np.nan, '2019-10-24', np.nan], 'val1': [np.nan, 1, np.nan], 'val2': [np.nan, 2, np.nan]})

           ts  val1  val2
0         NaN   NaN   NaN
1  2019-10-24   1.0   2.0
2         NaN   NaN   NaN

CodePudding user response:

import pandas as pd
import numpy as np

intervals = pd.DataFrame({'start': ['2019-06-23', '2020-01-10'], 'end': ['2019-06-24', '2020-01-12']})
data = pd.DataFrame({'ts': ['2019-06-23', '2019-10-24', '2020-01-11'], 'val1': [1, 1, 1], 'val2': [2, 2, 2]})


def find_corrupted(x):
    # returns a list of all data index numbers where data.ts are between start and end
    return data[((data.ts >= x['start']) & (data.ts <= x['end']))].index.values

# applying a list of index numbers to a new column in intervals
intervals['corrupt'] = intervals.apply(lambda x: find_corrupted(x), axis=1)

# creating list of unique index numbers
mask = intervals.corrupt.explode().unique()


data.loc[data.iloc[mask].index, :] = np.nan

OUTPUT

          ts  val1  val2
0         NaN   NaN   NaN
1  2019-10-24   1.0   2.0
2         NaN   NaN   NaN

CodePudding user response:

You can use:

interval_dict = dict(zip(intervals['start'],intervals['end'])) # {'2019-06-23': '2019-06-24', '2020-01-10': '2020-01-12'}
data['check'] = data['ts'].apply(lambda x: any(x >= k and x <= v for k,v in interval_dict.items())) # check ts value. If ts value in range return True
data.loc[data['check'].eq(True), data.columns] = np.nan # if check is True replace rows with nan

data = data.drop('check',axis=1)

Or one line:

data.loc[data['ts'].apply(lambda x: any(x >= k and x <= v for k,v in dict(zip(intervals['start'],intervals['end'])).items())).eq(True), data.columns] = np.nan

Output:

          ts  val1  val2
0        NaT   NaN   NaN
1 2019-10-24   1.0   2.0 
2        NaT   NaN   NaN   

Note: Make sure the date columns are of type datetime before you start.

intervals['start']=pd.to_datetime(intervals['start'])
intervals['end']=pd.to_datetime(intervals['end'])
data['ts']=pd.to_datetime(data['ts'])
  • Related