Home > Software design >  Filter records dataframe with date and NaN values
Filter records dataframe with date and NaN values

Time:09-28

I have a dataframe as

  index    col1                     col2                   col3
    0      2022-09-25 21:00:00    2022-09-25 20:00:00       NaN
    1      2022-09-25 21:26:00        NaN                   NaN
    2      2022-09-25 21:00:00    2022-09-25 22:00:00     2022-09-25 23:00:00
    3      2022-09-25 21:00:00    2022-09-25 22:00:00     2022-09-25 20:00:00
    4      2022-09-25 21:00:00    2022-09-25 23:00:00       NaN

I want to filter the rows for which col1 < col2 and col2 < col3. In this case, the condition date < NaN should always return true.

My desired output is

  index    col1                     col2                   col3
    1      2022-09-25 21:26:00        NaN                   NaN
    2      2022-09-25 21:00:00    2022-09-25 22:00:00     2022-09-25 23:00:00
    4      2022-09-25 21:00:00    2022-09-25 23:00:00       NaN

index 0 is removed because col1 > col2 and index 3 is removed because col2 > col3.

CodePudding user response:

By default, comparison with NaT always gives False.

To get True, you can use the reverse condition and invert it:

# ensure datetime
df[['col1', 'col2', 'col3']] = df[['col1', 'col2', 'col3']].apply(pd.to_datetime)

# not col1 greater or equal than col2
m1 = ~df['col1'].ge(df['col2'])
# not col2 greater or equal than col3
m2 = ~df['col2'].ge(df['col3'])

out = df[m1&m2]

Output:

   index                col1                col2                col3
1      1 2022-09-25 21:26:00                 NaT                 NaT
2      2 2022-09-25 21:00:00 2022-09-25 22:00:00 2022-09-25 23:00:00
4      4 2022-09-25 21:00:00 2022-09-25 23:00:00                 NaT

CodePudding user response:

My own approach is as follows:

import datetime 

df['diff_1'] = pd.to_datetime(df['col_2']) - pd.to_datetime(df['col_1'])
df['diff_1'] = df['diff_1'].dt.total_seconds()
m1 = (df['diff_1'] > 0) | (clean_json['diff_1'].isnull())


df['diff_2'] = pd.to_datetime(df['col_3']) - pd.to_datetime(clean_json['col_2'])
df['diff_2'] = df['diff_2'].dt.total_seconds()
m2 = (df['diff_2'] > 0) | (df['diff_2'].isnull())

 out = df[m1 & m2]
  • Related