I am new to programming. Just started a few months ago and I hope I can get some help.
I have a flight delays dataset with columns 'Year', 'Month', 'DayOfMonth', 'DayOfWeek' and 'CRSDepTime' with int64 Dtype.
I need to perform analysis and visualistions to identify the month, day and time with the lowest delays.
Would you advise to convert all dtypes to datetime? Can I use pandas' to_datetime() function? If yes, what should the format be?
Thanks in advance! :)
I tried:
df['CRSDepTime'] = pd.to_datetime(df['CRSDepTime'], format='HHMM')
But I am not too sure of the format and it always gives: ValueError: time data '1605' does not match format 'HHMM' (match)
CodePudding user response:
Use to_datetime
with format
by %H%M
fr match HHMM
and errors='coerce'
for NaT
if not parseable times, last use Series.dt.time
:
df['CRSDepTime'] = pd.to_datetime(df['CRSDepTime'], format='%H%M', errors='coerce').dt.time
For vectorized solution for datetimes need to_datetime
, only need Day
column name and add columns Hour
and Minute
:
cols = ['Year', 'Month', 'DayOfMonth']
df['date'] = (pd.to_datetime(df[cols].rename(columns={'DayOfMonth':'Day'})
.assign(Hour=df['CRSDepTime'] // 100, Minute=df['CRSDepTime'] % 100)))
print (df)
Year Month DayOfMonth DayOfWeek CRSDepTime date
0 2005 1 28 5 1605 2005-01-28 16:05:00
1 2005 1 29 6 1605 2005-01-29 16:05:00
2 2005 1 30 7 1610 2005-01-30 16:10:00
3 2005 1 31 1 1605 2005-01-31 16:05:00
4 2005 1 2 7 1900 2005-01-02 19:00:00
5 2005 1 3 1 1900 2005-01-03 19:00:00
Performance:
#6k rows
df = pd.concat([df] * 1000, ignore_index=True)
#Tim Roberts solution
In [51]: %timeit df.apply(translate,axis=1)
173 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [52]: %timeit (pd.to_datetime(df[['Year', 'Month', 'DayOfMonth']].rename(columns={'DayOfMonth':'Day'}).assign(Hour=df['CRSDepTime'] // 100, Minute=df['CRSDepTime'] % 100)))
6.23 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CodePudding user response:
Assuming those are all integers, you can get a single timestamp for the row this way:
import pandas as pd
import datetime
data = [
[2005,1,28,5,1605],
[2005,1,29,6,1605],
[2005,1,30,7,1610],
[2005,1,31,1,1605],
[2005,1,2,7,1900],
[2005,1,3,1,1900],
]
def translate(row):
return datetime.datetime( row['Year'],row['Month'],row['DayOfMonth'],row['CRSDepTime']//100, row['CRSDepTime']0)
df = pd.DataFrame(data, columns=['Year','Month','DayOfMonth','DayOfWeek','CRSDepTime'])
df['timestamp'] = df.apply(translate,axis=1)
print(df)
Output:
Year Month DayOfMonth DayOfWeek CRSDepTime timestamp
0 2005 1 28 5 1605 2005-01-28 16:05:00
1 2005 1 29 6 1605 2005-01-29 16:05:00
2 2005 1 30 7 1610 2005-01-30 16:10:00
3 2005 1 31 1 1605 2005-01-31 16:05:00
4 2005 1 2 7 1900 2005-01-02 19:00:00
5 2005 1 3 1 1900 2005-01-03 19:00:00