How to convert column values present in date/custom/general formats in a PySpark/Pandas dataframe in-CodePudding

I have a dataframe which has a column consisting of Date values present in multiple formats (shown as custom/general/date in Excel) like what you can see in the "Before" column below:

They all are originally date values only, but have somewhere got modified into different formats in the input CSV files that I got.

My objective is to convert the values into "DD/MM/YYYY" format that you can see in the "After" column.

I'm reading this as a Pandas/Spark dataframe since there are thousands of such values in the column for which this change has to be effected.

I tried doing the following, but it isn't resulting into the correct "DD/MM/YYYY" format that I need. Moreover, some of the values are remaining unchanged too:

df['After'] = pd.to_datetime(df['Before'], format='%d-%m-%y, errors='coerce')

Can anyone please help with how to go about this?

Cheers!

CodePudding user response：

Here is my try:

df = pd.read_excel('test.xls')

df['ADATE'] = pd.to_datetime(pd.to_numeric(df['A'],errors='coerce'), unit='D', origin='1899-12-30').fillna(pd.to_datetime(df['A'],errors='coerce'))


print(df)

output (how last line will be readed???)

             A                   ADATE
0    43746.39028 2019-10-08 09:22:00.192
1          43735 2019-09-27 00:00:00.000
2  1/1/2021 0:00 2021-01-01 00:00:00.000
3        50:11.0                     NaT

EDIT

Using ".dt.strftime('%d/%m/%Y')" you use the desired format, remembering that this will change the type.

df['ADATE'] = pd.to_datetime(pd.to_numeric(df['A'],errors='coerce'), unit='D', origin='1899-12-30').fillna(pd.to_datetime(df['A'],errors='coerce')).dt.strftime('%d/%m/%Y')