Home > Enterprise >  Converting pandas dataframe of only string types to pyspark dataframe fails
Converting pandas dataframe of only string types to pyspark dataframe fails

Time:02-23

Let's say i have a pandas dataframe of the following format which i already converted to string, since i dont want to define a schema for it, in order to be able to convert to pyspark df. Therefore I converted the dataframe like this:

train_pd = X_train_f.astype('string')
train_pd.info(verbose=True)

 #    Column                               Dtype 
---   ------                               ----- 
 0    col1                                 string
 1    col2                                 string
 2    col3                                 string
 3    col4                                 string

When I now run the following code i get the following error message.

training = spark.createDataFrame(train_pd)

TypeError: field col15: Can not merge type <class 'pyspark.sql.types.StructType'> and <class 'pyspark.sql.types.StringType'>

Why is that, I thought that by converting everything to string I would bypass the schema inference.

Sample Data

col0    col1    col2    col3    col4    col5    col6    col7    col8    col9    col10   col11   col12   col13   col14   col15   col16   col17   col18   col19   col20   col21   col22   col23   col24   col25   col26   col27   col28   col29   col30   col31   col32   col33   col34   col35   col36   col37   col38   col39   ...     col355  col356  col357  col358  col359  col360  col361  col362  col363  col364  col365  col366  col367  col368  col369  col370  col371  col372  col373  col374  col375  col376  col377  col378  col379  col380  col381  col382  col383  col384  col385  col386  col387  col388  col389  col390  col391  col392  col393  col394
DUMMY   DUMMY   DUMMY   DUMMY   DUMMY   DUMMY   DUMMY   DUMMY   1144418     0   1908    0   DUMMY       DUMMY   50  <NA>    <NA>    0   0   0001    DUMMY       2021-11-03 16:51:25     2021-11-03 17:23:13     04  <NA>    <NA>    <NA>    DUMMY   DUMMY       DUMMY       DUMMY       DUMMY       7   DUMMY   <NA>    DUMMY   DUMMY       <NA>    30  4315    ...     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0

CodePudding user response:

Run this script from Converting Pandas dataframe into Spark dataframe error before you run training = pandas_to_spark(train_pd):

from pyspark.sql.types import *

# Auxiliar functions
def equivalent_type(f):
    if f == 'datetime64[ns]': return TimestampType()
    elif f == 'int64': return LongType()
    elif f == 'int32': return IntegerType()
    elif f == 'float64': return FloatType()
    else: return StringType()

def define_structure(string, format_type):
    try: typo = equivalent_type(format_type)
    except: typo = StringType()
    return StructField(string, typo)

# Given pandas dataframe, it will return a spark's dataframe.
def pandas_to_spark(pandas_df):
    columns = list(pandas_df.columns)
    types = list(pandas_df.dtypes)
    struct_list = []
    for column, typo in zip(columns, types): 
      struct_list.append(define_structure(column, typo))
    p_schema = StructType(struct_list)
    return sqlContext.createDataFrame(pandas_df, p_schema)
  • Related