I'm trying to build a computed column in dask, a datetime from separate fields year, month, day, hour. And I can't find a way to make it work.
With the method below it's creating a datetime column, but inside it's not datetime. I've tried different formulas, but none that work.
- Python 3.8.10
- pandas==1.5.3
- dask==2023.1.1
- aiohttp==3.8.3
Get the data
# data processing
import dask.dataframe as dd
# web data source
url = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
# read demo data
dtable = dd.read_csv(url)
# print list of columns
print('demo data list of fields : ',dtable.columns)
result:
demo data list of fields : Index(['year', 'month', 'day',
'dep_delay', 'arr_delay', 'carrier', 'origin', 'dest', 'air_time',
'distance', 'hour'], dtype='object')
Then create the field. It looks like working, but no.
# create datetime column from the 'year','month','day','hour' fields
dtable['flight_datetime'] = dd.to_datetime(
(dtable.year *1000000
dtable.month*10000
dtable.day*100
dtable.hour).astype(str), format='%Y%m%d%H', errors='ignore')
print('demo data list of fields : ',dtable.columns)
print('demo data fields types : ',dtable.dtypes)
print(dtable.flight_datetime.head())
print(dtable.flight_datetime.dt.year.head())
result:
demo data list of fields : Index(['year', 'month', 'day',
'dep_delay', 'arr_delay', 'carrier', 'origin', 'dest', 'air_time',
'distance', 'hour', 'flight_datetime'], dtype='object')
demo data fields types :
year int64
month int64
day int64
dep_delay int64
arr_delay int64
carrier object
origin object
dest object
air_time int64
distance int64
hour int64
flight_datetime datetime64[ns]
dtype: object
0 2014010109
1 2014010111
2 2014010119
3 2014010107
4 2014010113
Name: flight_datetime, dtype: object
AttributeError: 'Series' object has no attribute 'year'
CodePudding user response:
As @RomanPerekhrest says in the comments, your not using the correct syntax for dd.to_datetime
. The following is working for me:
dtable_time = dtable[['year','month','day','hour']]
dtable['flight_datetime'] = dd.to_datetime(dtable_time)
print('demo data list of fields : ', dtable.columns)
print('demo data fields types : ', dtable.dtypes)
print(dtable.flight_datetime.head())
print(dtable.flight_datetime.dt.year.head())
outputs:
demo data list of fields : Index(['year', 'month', 'day', 'dep_delay', 'arr_delay', 'carrier', 'origin',
'dest', 'air_time', 'distance', 'hour', 'flight_datetime'],
dtype='object')
demo data fields types : year int64
month int64
day int64
dep_delay int64
arr_delay int64
carrier object
origin object
dest object
air_time int64
distance int64
hour int64
flight_datetime datetime64[ns]
dtype: object
0 2014-01-01 09:00:00
1 2014-01-01 11:00:00
2 2014-01-01 19:00:00
3 2014-01-01 07:00:00
4 2014-01-01 13:00:00
Name: flight_datetime, dtype: datetime64[ns]
0 2014
1 2014
2 2014
3 2014
4 2014
Name: flight_datetime, dtype: int64