Home > Mobile >  How can i filter out rows where the difference is less than a threshold?
How can i filter out rows where the difference is less than a threshold?

Time:10-02

The problem and context:

I am working with electronic equipment, which records events measured using a reed switch. It records the time that the reed switch opens (recorded as 0) and closes (recorded as 1). This equipment sometimes records false measurements at timescales we know are not realistic. The 'default state' of the device is normally closed.

What I am trying to do:

Filter out the changes in state, where the time difference is less than 60 seconds. But...often times there are a bunch of false openings and closings following each other. Additionally, the resulting dataframe must have events ordered as 1,0,1,0,1 etc - the device can only open, then close, then open etc.

What I have tried:

I have got some code working, but it's long and messy. I am sure there must be a better way of doing this. Any suggestions would be great.

Example original dataframe

import numpy as np
import pandas as pd

df = pd.DataFrame({'date' : ['2019-05-10 12:20:31', '2019-05-10 12:25:43', '2019-05-10 12:25:57', '2019-05-10 12:28:00', '2019-05-10 12:30:01', '2019-05-10 12:35:00', '2019-05-10 12:35:30', '2019-05-10 12:35:45', '2019-05-10 12:37:00', '2019-05-10 12:40:00', '2019-05-10 12:45:00'], 'event' : [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]}, index=range(11))

df['date'] = pd.to_datetime(df['date'], errors='raise', dayfirst=True)
df['event'] = pd.to_numeric(df['event'])
df
Out[21]: 
                  date  event
0  2019-05-10 12:20:31      1
1  2019-05-10 12:25:43      0
2  2019-05-10 12:25:57      1
3  2019-05-10 12:28:00      0
4  2019-05-10 12:30:01      1
5  2019-05-10 12:35:00      0
6  2019-05-10 12:35:30      1
7  2019-05-10 12:35:45      0
8  2019-05-10 12:37:00      1
9  2019-05-10 12:40:00      0
10 2019-05-10 12:45:00      1

Functions created


def filter_event_filter_dups(input_data, input_variable_name):
    """
    This is an 'inside' function for the main filter_event function and iteratively removes duplicates

    """
    df_tmp = input_data.copy()
    df_tmp = df_tmp.dropna()
    df_tmp = df_tmp.reset_index()
    df_tmp_final = pd.DataFrame(columns=['date', input_variable_name], index=range(len(df_tmp)))
    for index, row in df_tmp.iterrows():
        if index == 0:
            print('')
            df_tmp_final['date'].loc[index] = df_tmp['date'].loc[index]
            df_tmp_final[input_variable_name].loc[index] = df_tmp[input_variable_name].loc[index]
        else:
            if index   1 == len(df_tmp):
                print('---------------------------------------------------------------')
                print('Index = ', index)
                print('End!')
            else:
                print('---------------------------------------------------------------')
                print('Index = ', index)
                
                if df_tmp[input_variable_name].loc[index] == df_tmp[input_variable_name].loc[index-1]:
                    # keep only first
                    # delete second
                    # df_tmp = df_tmp.drop([df_tmp.loc[index 1]])
                    df_tmp_final['date'].loc[index-1] = df_tmp['date'].loc[index-1]
                    df_tmp_final[input_variable_name].loc[index-1] = df_tmp[input_variable_name].loc[index-1]
                elif df_tmp[input_variable_name].loc[index] != df_tmp[input_variable_name].loc[index-1]:
                    df_tmp_final['date'].loc[index] = df_tmp['date'].loc[index]
                    df_tmp_final[input_variable_name].loc[index] = df_tmp[input_variable_name].loc[index]
    return(df_tmp_final)


def filter_event(input_data, input_variable_name, input_delta_time_threshold_seconds):
    """
    This function is designed to filer out events which are very close together.
    """
    df_tmp = input_data.reset_index()
    df_tmp[input_variable_name] = pd.to_numeric(df_tmp[input_variable_name])
    del df_tmp['index']
    # output_df = pd.DataFrame()
    output_df = pd.DataFrame(columns=['date', input_variable_name], index=range(len(df_tmp)))
    for index, row in df_tmp.iterrows():
        if index 1 == len(df_tmp):
            print('---------------------------------------------------------------')
            print('Index = ', index)
            print('End!')
        else:
            # if df_tmp[input_variable_name].loc[index] == 0:
            print('---------------------------------------------------------------')
            print('Index = ', index)
            if ((df_tmp['date'].iloc[index 1] - df_tmp['date'].iloc[index]).total_seconds())<input_delta_time_threshold_seconds:
                # If less than threshold
                print('This is LESS than threshold - Threshold = ', input_delta_time_threshold_seconds, 'seconds')
                print('Time difference to next 0 = ', df_tmp['date'].loc[index 1]-df_tmp['date'].loc[index])
                print('Ignore')
            if ((df_tmp['date'].iloc[index 1] - df_tmp['date'].iloc[index]).total_seconds())>input_delta_time_threshold_seconds:
                # If more than threshold
                print('This is MORE than threshold - Threshold = ', input_delta_time_threshold_seconds, 'seconds') 
                print('Time difference to next 0 = ', df_tmp['date'].loc[index 1]-df_tmp['date'].loc[index])
                output_df['date'].loc[index] = df_tmp['date'].loc[index]
                output_df[input_variable_name].loc[index] = df_tmp[input_variable_name].loc[index]
                # output_df['date'].loc[index 1] = df_tmp['date'].loc[index 1]
                # output_df[input_variable_name].loc[index 1] = df_tmp[input_variable_name].loc[index 1]
    output_df = output_df.dropna()
    output_df = output_df.reset_index()
    
    output_df[input_variable_name] = pd.to_numeric(output_df[input_variable_name])
    #
    #
    #
    output_df_final = filter_event_filter_dups(input_data = output_df, input_variable_name = input_variable_name)
    
    # for i in range(2):
    #     output_df_final = filter_event_filter_dups(input_data = output_df_final, input_variable_name = input_variable_name)
    
    output_df_final = output_df_final.dropna()
    output_df_final[input_variable_name] = pd.to_numeric(output_df_final[input_variable_name])
    # output_df_final['date'] = pd.to_datetime(output_df_final, errors='raise', dayfirst=True)
    output_df_final['date'] = output_df_final['date'].apply(pd.to_datetime)


    return(output_df_final)
    

test = filter_event(input_data = df
                    , input_variable_name = 'event'
                    , input_delta_time_threshold_seconds = 60
                    )

What the final DF should look like

final_df
Out[20]: 
                 date  event
0 2019-05-10 12:20:31      1
1 2019-05-10 12:28:00      0
2 2019-05-10 12:30:01      1
3 2019-05-10 12:40:00      0
4 2019-05-10 12:45:00      1

CodePudding user response:

You could use a mask:

mask = df['date'].diff().lt('60s')
df[~(mask|mask.shift(-1))]

and if you want to ensure swapping of the events:

mask = df['date'].diff().lt('60s')
df[~(mask|mask.shift(-1))][lambda d: d['event'].ne(d['event'].shift())]

output:

                  date  event
0  2019-05-10 12:20:31      1
3  2019-05-10 12:28:00      0
4  2019-05-10 12:30:01      1
9  2019-05-10 12:40:00      0
10 2019-05-10 12:45:00      1

CodePudding user response:

With .diff() we can get the difference in timestamps and remove those less than 60 seconds apart:

>>> delay = df['date'].diff().dt.total_seconds()
>>> df_min60s = df[delay.fillna(np.inf).ge(60) & delay.shift(-1).fillna(np.inf).ge(60)]
>>> df_min60s
                  date  event
0  2019-05-10 12:20:31      1
3  2019-05-10 12:28:00      0
4  2019-05-10 12:30:01      1
8  2019-05-10 12:37:00      1
9  2019-05-10 12:40:00      0
10 2019-05-10 12:45:00      1

Then we can remove repeated events:

>>> df_final = df_min60s[df_min60s['event'].ne(df_min60s['event'].shift())]
>>> df_final
                  date  event
0  2019-05-10 12:20:31      1
3  2019-05-10 12:28:00      0
4  2019-05-10 12:30:01      1
9  2019-05-10 12:40:00      0
10 2019-05-10 12:45:00      1
  • Related