Home > database >  Pandas / Wrong date format when reading Excel file
Pandas / Wrong date format when reading Excel file

Time:10-18

I am reading an excel files :

dfEx=pd.read_excel(fileExcel, engine='openpyxl')

There is a date column in Object format, and dfEx["Date"] gives me these results :

id date
0 15/10/2022 10:44:59
1 15/10/2022 10:47:53
2 15/10/2022 10:49:50
3 15/10/2022 10:50:40
4 15/10/2022 10:54:05
...
174 2022-10-16 00:00:00
175 2022-10-16 00:00:00
176 2022-10-16 00:00:00
177 2022-10-16 00:00:00
178 2022-10-16 00:00:00

I don't get why I have the format change between the beginning and the end, as in the Excel file the separator character is always / for year/month/day

The syntax I have in the Excel file is :

15/10/2022 11:13:64

for the first and

15-10-2022

for the rest

So when I use the datetime function it doesn't work :

dfEx['Date'] =  pd.to_datetime(dfEx['Date'], format='%d/%m/%Y %H:%M:%S', exact=False)

ValueError: time data '2022-10-15 00:00:00' does not match format '%d/%m/%Y %H:%M:%S' (search)

CodePudding user response:

Add errors='coerce' for missing values if no match format 15/10/2022 10:44:59 and then replace missing values by another to_datetime for convert format 2022-10-16 00:00:00:

dfEx['date'] = (pd.to_datetime(dfEx['date'], format='%d/%m/%Y %H:%M:%S', errors='coerce')
                  .fillna(pd.to_datetime(dfEx['date'], errors='coerce')))
print (dfEx)
   id                 date
0    0 2022-10-15 10:44:59
1    1 2022-10-15 10:47:53
2    2 2022-10-15 10:49:50
3    3 2022-10-15 10:50:40
4    4 2022-10-15 10:54:05
5  174 2022-10-16 00:00:00
6  175 2022-10-16 00:00:00
7  176 2022-10-16 00:00:00
8  177 2022-10-16 00:00:00
9  178 2022-10-16 00:00:00

CodePudding user response:

I see that you have two different dates formats in excel:

  • Ones with dashes 2022-10-16 00:00:00
  • Others with slashes 15/10/2022 10:54:05

So you cannot parse all the dates at once, you first have to split the dataframes:

df_slashes = df[df.Date.str.contains("/", regex=False))
df_dashes = df[df.Date.str.contains("-", regex=False))

Then for each dataframe run the corresponding to_datetime function:

df_slashes_dates = pd.to_datetime(df_slashes.Date, format='%d/%m/%Y %H:%M:%S')
df_dashes_dates = pd.to_datetime(df_dashes.Date, format='%d-%m-%Y %H:%M:%S')

Finally do whatever you want with these two series.

  • Related