Is there a function to convert time HHMM (int64 in a df column) to a datetime object?-CodePudding

I am new to programming. Just started a few months ago and I hope I can get some help.

I have a flight delays dataset with columns 'Year', 'Month', 'DayOfMonth', 'DayOfWeek' and 'CRSDepTime' with int64 Dtype.

Screenshot of df

I need to perform analysis and visualistions to identify the month, day and time with the lowest delays.

Would you advise to convert all dtypes to datetime? Can I use pandas' to_datetime() function? If yes, what should the format be?

Thanks in advance! :)

I tried:

df['CRSDepTime'] = pd.to_datetime(df['CRSDepTime'], format='HHMM')

But I am not too sure of the format and it always gives: ValueError: time data '1605' does not match format 'HHMM' (match)

CodePudding user response：

Use to_datetime with format by %H%M fr match HHMM and errors='coerce' for NaT if not parseable times, last use Series.dt.time:

df['CRSDepTime'] = pd.to_datetime(df['CRSDepTime'], format='%H%M', errors='coerce').dt.time

For vectorized solution for datetimes need to_datetime, only need Day column name and add columns Hour and Minute:

cols = ['Year', 'Month', 'DayOfMonth']
df['date'] = (pd.to_datetime(df[cols].rename(columns={'DayOfMonth':'Day'})
                  .assign(Hour=df['CRSDepTime'] // 100, Minute=df['CRSDepTime'] % 100)))
 
print (df)
   Year  Month  DayOfMonth  DayOfWeek  CRSDepTime                date
0  2005      1          28          5        1605 2005-01-28 16:05:00
1  2005      1          29          6        1605 2005-01-29 16:05:00
2  2005      1          30          7        1610 2005-01-30 16:10:00
3  2005      1          31          1        1605 2005-01-31 16:05:00
4  2005      1           2          7        1900 2005-01-02 19:00:00
5  2005      1           3          1        1900 2005-01-03 19:00:00

Performance:

#6k rows
df = pd.concat([df] * 1000, ignore_index=True)


#Tim Roberts solution
In [51]: %timeit df.apply(translate,axis=1)
173 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [52]: %timeit (pd.to_datetime(df[['Year', 'Month', 'DayOfMonth']].rename(columns={'DayOfMonth':'Day'}).assign(Hour=df['CRSDepTime'] // 100, Minute=df['CRSDepTime'] % 100)))
6.23 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

CodePudding user response：

Assuming those are all integers, you can get a single timestamp for the row this way:

import pandas as pd
import datetime

data = [
    [2005,1,28,5,1605],
    [2005,1,29,6,1605],
    [2005,1,30,7,1610],
    [2005,1,31,1,1605],
    [2005,1,2,7,1900],
    [2005,1,3,1,1900],
]

def translate(row):
    return datetime.datetime( row['Year'],row['Month'],row['DayOfMonth'],row['CRSDepTime']//100, row['CRSDepTime']0)

df = pd.DataFrame(data, columns=['Year','Month','DayOfMonth','DayOfWeek','CRSDepTime'])

df['timestamp'] = df.apply(translate,axis=1)
print(df)

Output:

   Year  Month  DayOfMonth  DayOfWeek  CRSDepTime           timestamp
0  2005      1          28          5        1605 2005-01-28 16:05:00
1  2005      1          29          6        1605 2005-01-29 16:05:00
2  2005      1          30          7        1610 2005-01-30 16:10:00
3  2005      1          31          1        1605 2005-01-31 16:05:00
4  2005      1           2          7        1900 2005-01-02 19:00:00
5  2005      1           3          1        1900 2005-01-03 19:00:00