convert dask dataframe column to string-CodePudding

import pandas as pd                                                 
                                                                    
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}     
                                                                    
df = pd.DataFrame(data)                                             
                                                                    
df = dd.from_pandas(df, npartitions=1)

I want to convert the calories column to string.

So here is my try

df = df.astype({"calories": "string"})

df
Dask DataFrame Structure:
              calories duration
npartitions=1
0               string    int64
2                  ...      ...
Dask Name: astype, 3 tasks

df.set_index("calories")
TypeError: Cannot interpret 'string[python]' as a data type

Is there a way I can pass in the datatype for all the columns and convert them to the desired datatype ? Like say I want to convert many columns to strings and some of them to date and few to bools.

I know the column names and the data type. And want Dask to honor them.

TypeError: Cannot interpret 'string[python]' as a data type

CodePudding user response：

Try with lambda function.

import pandas as pd                                                 
                                                                    
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}                                                                  
df = pd.DataFrame(data)                                             
df['calories'] = df['calories'].apply(lambda x: str(x))                                                               
print(df)
for column in df.columns:
    print("Column ", column, "is dtype:", df[column].dtype.name)

Output

  calories  duration
0      420        50
1       hi        40
2      390        45
Column  calories is dtype: object
Column  duration is dtype: int64

Edit If you want to convert all the columns of the data frame to string, you can use applymap.

import pandas as pd                                                 
                                                                    
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}                                                                  
df = pd.DataFrame(data)                                             
df = df.applymap(str)
for column in df.columns:
    print("Column ", column, "is dtype:", df[column].dtype.name)

Output

Column  calories is dtype: object
Column  duration is dtype: object

Or using lambda and applymap

import pandas as pd                                                 
                                                                
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45]}                                                                  
df = pd.DataFrame(data)                                             
df = df.applymap(lambda x: x[0] if type(x) is list else None)
for column in df.columns:
    print("Column ", column, "is dtype:", df[column].dtype.name)

Output

Column  calories is dtype: object
Column  duration is dtype: object

CodePudding user response：

It seems the error happens when you call set_index and Dask is not able to recognize "string" as valid data type when setting the new partition divisions. Instead you can use str, e.g. ddf = ddf.astype({"calories": str}). Here's a complete reproducible snippet:

import pandas as pd
import dask.dataframe as dd
                                                                    
data = {"calories": [420, "hi", 390], "duration": [50, 40, 45], "other_col": range(3)}                               
df = pd.DataFrame(data)                                          
ddf = dd.from_pandas(df, npartitions=2)

ddf = ddf.astype({"calories": str}).set_index('calories')
ddf.compute()