import pandas as pd
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data)
df = dd.from_pandas(df, npartitions=1)
I want to convert the calories column to string.
So here is my try
df = df.astype({"calories": "string"})
df
Dask DataFrame Structure:
calories duration
npartitions=1
0 string int64
2 ... ...
Dask Name: astype, 3 tasks
df.set_index("calories")
TypeError: Cannot interpret 'string[python]' as a data type
Is there a way I can pass in the datatype for all the columns and convert them to the desired datatype ? Like say I want to convert many columns to strings and some of them to date and few to bools.
I know the column names and the data type. And want Dask to honor them.
TypeError: Cannot interpret 'string[python]' as a data type
CodePudding user response:
Try with lambda function.
import pandas as pd
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data)
df['calories'] = df['calories'].apply(lambda x: str(x))
print(df)
for column in df.columns:
print("Column ", column, "is dtype:", df[column].dtype.name)
Output
calories duration
0 420 50
1 hi 40
2 390 45
Column calories is dtype: object
Column duration is dtype: int64
Edit If you want to convert all the columns of the data frame to string, you can use applymap.
import pandas as pd
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data)
df = df.applymap(str)
for column in df.columns:
print("Column ", column, "is dtype:", df[column].dtype.name)
Output
Column calories is dtype: object
Column duration is dtype: object
Or using lambda and applymap
import pandas as pd
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}
df = pd.DataFrame(data)
df = df.applymap(lambda x: x[0] if type(x) is list else None)
for column in df.columns:
print("Column ", column, "is dtype:", df[column].dtype.name)
Output
Column calories is dtype: object
Column duration is dtype: object
CodePudding user response:
It seems the error happens when you call set_index
and Dask is not able to recognize "string"
as valid data type when setting the new partition divisions. Instead you can use str
, e.g. ddf = ddf.astype({"calories": str})
. Here's a complete reproducible snippet:
import pandas as pd
import dask.dataframe as dd
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45], "other_col": range(3)}
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions=2)
ddf = ddf.astype({"calories": str}).set_index('calories')
ddf.compute()