Home > Enterprise >  assign a converted datetime str back to the dask df
assign a converted datetime str back to the dask df

Time:02-26

I got my dask dataframe string date column converted to a pandas datetime and it created a datetimeindex. When I try assigning it back to the source dask dataframe using

ddf.assign(date=date_parsed) line, I get a

ValueError: Length of values (1000000) does not match length of index (2).

I initially thought the create datetimeindex have the correct length but the source have only 2 indeces. I tried converting the datetimeindex into a pd.dataframe, which successfully converted but I cannot add that pd.df into the daskdf. I also tried converting it back to a series, but still not able to append/assign.

What I would like to do is to assign the datetimeindex back to the source dask df.

sample dask df converted from pd. all values are string datatype.

df=pd.DataFrame({'fname': ['dwayne','peter','dead','wonder'], 
                 'lname': ['rock','pan','pool','boy'], 
                 'entrydate':['31DEC2021', '22JAN2022', NaN, '15DEC2025']})

ddf = dd.from_pandas(df) 

what I did: (1) parsed the entrydate values and converted to datetime. it gave me the following:

DatetimeIndex(['2021-12-31', '2022-01-22', 'NaT', '2025-12-15'], dtype='datetime64[ns]', length=4, freq=None)

(2) I dropped the 'entrydate' column using the drop function. (3) When I tried the assign function, I get the ValueError...

CodePudding user response:

Hi this value error occoured because NaN value use errors='coerce' attribute to ignore errors. Pandas inbuild datetime conversion mechanism is enough for this issue.

df['entrydate'] = pd.to_datetime(df['entrydate'], errors='coerce').dt.strftime('%Y-%m-%d')

CodePudding user response:

There is no need to create new column using assign. Dask dataframe supports pandas API, so the following works:

import dask.dataframe as dd
import pandas as pd

df=pd.DataFrame({'fname': ['dwayne','peter','dead','wonder'], 
                 'lname': ['rock','pan','pool','boy'], 
                 'entrydate':['31DEC2021', '22JAN2022', NaN, '15DEC2025']})

ddf = dd.from_pandas(df, npartitions=2)

# roughly same as ddf.assign(date=date_parsed)
ddf["date"] = dd.to_datetime(ddf["entrydate"])

See also this answer.

  • Related