I'm given two (pandas) data frames:
data
: with time-stamps columnsts
and other data columnsval1
, andval2
.intervals
: with two columns,start
andend
, that describe a series of time intervals in whichdata
is corrupted
my goal - replace all row values in data
with NaN
if that row's ts
is within some range described by start
, end
. (Namely, replace the i
'th row in data
with [NaN
,...,NaN
] if there exists some row j
in intervals
such that the ts_i
is in [start_j,end_j
])
My current (naive) solution, iterates the ranges using a for
loop:
def _remove_corruptions(corruption_filename: str, df_to_clean: pd.DataFrame):
corruption_timestamps = pd.read_csv(corruption_filename)
for j, row in tqdm(corruption_timestamps.iterrows()):
start, end = row['start'], row['end']
df_to_clean[(start <= df_to_clean['ts']) & (df_to_clean['ts'] <= end)] = np.nan
return df_to_clean
Can I vectorize this process somehow?
an example is provided below:
intervals = pd.DataFrame({'start': ['2019-06-23', '2020-01-10'], 'end': ['2019-06-24', '2020-01-12']})
start end
0 2019-06-23 2019-06-24
1 2020-01-10 2020-01-12
data = pd.DataFrame({'ts': ['2019-06-23', '2019-10-24', '2020-01-11'], 'val1': [1, 1, 1], 'val2': [2, 2, 2]})
ts val1 val2
0 2019-06-23 1 2
1 2019-10-24 1 2
2 2020-01-11 1 2
required out:
out = pd.DataFrame({'ts': [np.nan, '2019-10-24', np.nan], 'val1': [np.nan, 1, np.nan], 'val2': [np.nan, 2, np.nan]})
ts val1 val2
0 NaN NaN NaN
1 2019-10-24 1.0 2.0
2 NaN NaN NaN
CodePudding user response:
import pandas as pd
import numpy as np
intervals = pd.DataFrame({'start': ['2019-06-23', '2020-01-10'], 'end': ['2019-06-24', '2020-01-12']})
data = pd.DataFrame({'ts': ['2019-06-23', '2019-10-24', '2020-01-11'], 'val1': [1, 1, 1], 'val2': [2, 2, 2]})
def find_corrupted(x):
# returns a list of all data index numbers where data.ts are between start and end
return data[((data.ts >= x['start']) & (data.ts <= x['end']))].index.values
# applying a list of index numbers to a new column in intervals
intervals['corrupt'] = intervals.apply(lambda x: find_corrupted(x), axis=1)
# creating list of unique index numbers
mask = intervals.corrupt.explode().unique()
data.loc[data.iloc[mask].index, :] = np.nan
OUTPUT
ts val1 val2
0 NaN NaN NaN
1 2019-10-24 1.0 2.0
2 NaN NaN NaN
CodePudding user response:
You can use:
interval_dict = dict(zip(intervals['start'],intervals['end'])) # {'2019-06-23': '2019-06-24', '2020-01-10': '2020-01-12'}
data['check'] = data['ts'].apply(lambda x: any(x >= k and x <= v for k,v in interval_dict.items())) # check ts value. If ts value in range return True
data.loc[data['check'].eq(True), data.columns] = np.nan # if check is True replace rows with nan
data = data.drop('check',axis=1)
Or one line:
data.loc[data['ts'].apply(lambda x: any(x >= k and x <= v for k,v in dict(zip(intervals['start'],intervals['end'])).items())).eq(True), data.columns] = np.nan
Output:
ts val1 val2
0 NaT NaN NaN
1 2019-10-24 1.0 2.0
2 NaT NaN NaN
Note: Make sure the date columns are of type datetime before you start.
intervals['start']=pd.to_datetime(intervals['start'])
intervals['end']=pd.to_datetime(intervals['end'])
data['ts']=pd.to_datetime(data['ts'])