I'm trying to create a dask dataframe from a numpy array. For that, I need to specify the column types. As suggested in dask documentation, I use for that a pandas empty dataframe. This doesn't throw an error, however all the data types are created as object
. I need to use the empty Pandas dataframe, how to make this work?
import pandas as pd
import dask.dataframe as dd
array = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001, 2, 2))])
meta = pd.DataFrame({'col1': pd.Series(dtype='float64'),
'col2': pd.Series(dtype='float64'),
'col3': pd.Series(dtype='float64'),
'date1': pd.Series(dtype='datetime64[ns]')})
print(meta.dtypes)
>>> col1 float64
>>> col2 float64
>>> col3 float64
>>> date1 datetime64[ns]
>>> dtype: object
columns = ['col1', 'col2', 'col3', 'date1']
ddf = dd.from_array(array, columns=columns, meta=meta)
ddf.compute()
print(ddf.dtypes)
>>> col1 object
>>> col2 object
>>> col3 object
>>> date1 object
>>> dtype: object
CodePudding user response:
Does this work -
df = (pd.DataFrame(array, columns = ["col1", "col2", "col3", "col4"])
.astype({"col1": "float64",
"col2": "float64",
"col3": "float64",
"col4": "datetime64[ns]"}))
ddf = dd.from_pandas(df, npartitions=10)
The output of ddf.dtypes
gives me the correct data types.
CodePudding user response:
Could dtypes be set after dataframe creation?
import pandas as pd
import numpy as np
from datetime import datetime
import dask.dataframe as dd
array = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001, 2, 2))])
columns = ['col1', 'col2', 'col3', 'date1']
ddf = dd.from_array(array, columns = columns)
ddf.compute()
ddf = ddf.astype({'col1': 'float64','col2':'float64','col3':'float64','date1':'datetime64[ns]'})
print(ddf.dtypes)