Home > OS >  Select rows based on multiple columns that are later than a certain date
Select rows based on multiple columns that are later than a certain date

Time:08-29

I have the following dataframe:

import pandas as pd
import numpy as np

np.random.seed(0)
# create an array of 5 dates starting at '2015-02-24', one per minute
rng = pd.date_range('2021-07-29', periods=5, freq='D')
rng_1  = pd.date_range('2021-07-30', periods=5, freq='D')
rng_2 = pd.date_range('2021-07-31', periods=5, freq='D')
df_status = ['received', 'send', 'received', 'send', 'send']
df = pd.DataFrame({ 'Date': rng, 'Date_1': rng_1, 'Date_2': rng_2, 'status': df_status }) 
print(df)

I would like to print out all the rows if at least one column contains a date that is equal to, or at least 2021-08-01. What would be the most effective way to do this?

I have tried to do this with the following code, however, I get the following error:

start_date = '2022-08-01'
start_date = pd.to_datetime(start_date, format="%Y/%m/%d")
mask = (df['Date'] >= start_date | df['Date_1'] >= start_date | df['Date_3'] >= start_date)

TypeError: unsupported operand type(s) for &: 'Timestamp' and 'DatetimeArray'

Thank you in advance.

CodePudding user response:

Use boolean indexing with any:

df[df.ge('2021-08-01').any(1)]

output:

        Date     Date_1     Date_2
1 2021-07-30 2021-07-31 2021-08-01
2 2021-07-31 2021-08-01 2021-08-02
3 2021-08-01 2021-08-02 2021-08-03
4 2021-08-02 2021-08-03 2021-08-04

intermediate:

df.ge('2021-08-01').any(1)

0    False
1     True
2     True
3     True
4     True
dtype: bool

using only the date columns

filtering by name (Date in the column name):

df[df.filter(like='Date').ge('2021-08-01').any(1)]

filtering by type:

df[df.select_dtypes('datetime64[ns]').ge('2021-08-01').any(1)]

CodePudding user response:

You may use any inside apply:

df[df.apply(lambda x: any([x[col] >= pd.to_datetime('2021-08-01') for col in df.columns]), axis=1)]
  • Related