Home > database >  Identify invalid dates in pandas dataframe columns
Identify invalid dates in pandas dataframe columns

Time:05-23

Suppose we had the following dataframe-

How can I create the fourth column 'Invalid dates' as specified below using the first three columns in the dataframe?

  Name       Date1       Date2  Invalid dates
0    A  01-02-2022  03-04-2000           None
1    B          23  12-12-2012          Date1
2    C  18-04-1993         abc          Date2
3    D          45         qcf   Date1, Date2

CodePudding user response:

You can select the Dates column with filter (or any other method, including a manual list), compute a Series of invalid dates by converting to_datetime and sub-selecting the NaN values (i.e. invalid dates) with isna,then stack and join to the original DataFrame:

s = (df
     .filter(like='Date') # keep only "Date" columns
      # convert to datetime, NaT will be invalid dates
     .apply(lambda s: pd.to_datetime(s, format='%d-%m-%Y', errors='coerce'))
     .isna()
     # reshape to long format (Series)
     .stack()
    )

out = (df
       .join(s[s].reset_index(level=1)    # keep only invalid dates
            .groupby(level=0)['level_1']  # for all initial indices
            .agg(','.join)                # join the column names
            .rename('Invalid Dates')
           )
       )

alternative with melt to reshape the DataFrame:

cols = df.filter(like='Date').columns

out = df.merge(
    df.melt(id_vars='Name', value_vars=cols, var_name='Invalid Dates')
      .assign(value=lambda d: pd.to_datetime(d['value'], format='%d-%m-%Y',
                                             errors='coerce'))
      .loc[lambda d: d['value'].isna()]
      .groupby('Name')['Invalid Dates'].agg(','.join),
    left_on='Name', right_index=True, how='left'
)

output:

  Name       Date1       Date2 Invalid Dates
0    A  01-02-2022  03-04-2000           NaN
1    B          23  12-12-2012         Date1
2    C  18-04-1993         abc         Date2
3    D          45         qcf   Date1,Date2

CodePudding user response:

Use DataFrame.filter for filter columns with substring Date, then convert to datetimes by to_datetime all columns of df1 with errors='coerce' for missing values if no match, so possible test them by DataFrame.isna and by DataFrame.dot extract columnsnames separated by ,:

df1 = df.filter(like='Date')
df['Invalid dates']=((df1.apply(lambda x:pd.to_datetime(x,format='%d-%m-%Y',errors='coerce'))
                        .isna() & df1.notna())
                        .dot(df1.columns   ',')
                        .str[:-1]
                        .replace('', np.nan))

print (df)
  Name       Date1       Date2 Invalid dates
0    A  01-02-2022  03-04-2000           NaN
1    B          23  12-12-2012         Date1
2    C  18-04-1993         abc         Date2
3    D          45         qcf   Date1,Date2
  • Related