How to check the dates in different columns?-CodePudding

I have a dataframe "expeditions" where there are 3 columns ("basecamp_date", "highpoint_date" and "termination_date"). I would like to check that the basecamp date is before the highpoint date and before the termination date because I noticed that there are rows where this is not the case (see picture) Do you have any idea what I should do (a loop, a new dataframe...?)

Code

import pandas as pd
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")

CodePudding user response：

I would start by transforming the columns into a datetime format to be extra sure:

for x in df:
  df[x] = pd.to_datetime(df[x],infer_datetime_format=True)

And then follow it by the comparison using np.where()

df['Check'] = np.where((df['basecamp_date'] < df['highpoint_date']) & (df['basecamp_date'] < df['termination_date']),True,False)

EDIT: Based on OPs follow up question I propose the following solution:

filtered = df[((df['basecamp_date'] < df['highpoint_date']) & (df['basecamp_date'] < df['termination_date'])) | (df.isna().sum(axis=1) != 0)]

Example:

  basecamp_date highpoint_date termination_date
0    2008-01-04     2008-05-01       2008-04-05
1           NaN     2008-05-03       2008-06-03
2    2008-01-04     2008-01-01       2009-01-01

Only row 0 should be kept as row 2 doesn't match that date conditions and row 1 has a null value. Using the proposed code, the output is:

  basecamp_date highpoint_date termination_date
0    2008-01-04     2008-05-01       2008-04-05
1           NaT     2008-05-03       2008-06-03

CodePudding user response：

You should convert your data to datetime format:

df['date_col'] = pd.to_datetime(df['date_col'])

and then do like this:

df[df['date_col1'] < df['date_col2']]

In your case date_col might be you column names.

CodePudding user response：

All the other answers are working solutions but I find this much easier:

df.query("basecamp_date < highpoint_date and basecamp_date
 < termination_date")

CodePudding user response：

Use Series.lt for compare columns and chain masks by & for bitwise AND:

m = (df['basecamp_date'].ge(df['highpoint_date']) |
             df['basecamp_date'].ge(df['termination_date']) |
             df[['basecamp_date', 'termination_date', 'highpoint_date']].notna().all(1)))

)

If need check matched values:

df1 = df[m]