Update:
I was able to perform the conversion. The next step is to put it back to the ddf.
What I did, following the book suggestion are:
- the dates were parsed and stored as a separate variable.
- dropped the original date column using
ddf2=ddf.drop('date',axis=1)
- appended the new parsed date using assign
ddf3=ddf2.assign(date=parsed_date)
the new date was added as a new column, last column.
Question 1: is there a more efficient way to insert the parsed_date back to the ddf?
Question 2: What if I have three columns of string dates (date, startdate, enddate), I am not able to find if loop will work so that I did not have to recode each string dates. (or I could be wrong in the approach I am thinking)
Question 3 for the date in 11OCT2020:13:03:12.452 format, is this the right parsing: "%d%b%Y:%H:%M:%S" ? I feel I am missing something for the seconds because the seconds above is a decimal number/float.
Older:
I have the following column in a dask dataframe:
ddf = dd.DataFrame({'date': ['15JAN1955', '25DEC1990', '06MAY1962', '20SEPT1975']})
when it was initially uploaded as a dask dataframe, it was projected as an object/string. While looking for guidance in the Data Science with Python and Dask book, it suggested that at the initial upload to upload it as np.str datatype. However, I could not understand how to convert the column into a date datatype. I tried processing it using dd.to_datetime, the confirmation returned dtype: datetime64[ns] but when I ran the ddf.dtypes, the frame still returned an object datatype.
I would like to change the object dtype to date to filter/run a condition later on
CodePudding user response:
dask.dataframe
supports pandas
API for handling datetimes, so this should work:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({"date": ["15JAN1955", "25DEC1990", "06MAY1962", "20SEPT1975"]})
print(pd.to_datetime(df["date"]))
# 0 1955-01-15
# 1 1990-12-25
# 2 1962-05-06
# 3 1975-09-20
# Name: date, dtype: datetime64[ns]
ddf = dd.from_pandas(df, npartitions=2)
ddf["date"] = dd.to_datetime(ddf["date"])
print(ddf.compute())
# date
# 0 1955-01-15
# 1 1990-12-25
# 2 1962-05-06
# 3 1975-09-20
CodePudding user response:
Usually when I am having a hard time computing or parsing, I use the apply lamba call. Although some says it is not a better way but it works. Give it a try