Pyspark pandas TypeError when try to concatenate two dataframes-CodePudding

I got an below error while I am trying to concatenate two pandas dataframes:


TypeError: cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid

At the beginning I thought It was emerged because of one the dataframe that includes list on some column. So I tried to concatenate the two dataframes that does not include list on their columns. But I got the same error. I printed the type of dataframes to be sure. Both of them are pandas.core.frame.DataFrame. Why I got this error even they are not list?

import pyspark.pandas as ps
split_col = split_col.toPandas()
split_col2 = split_col2.toPandas()
dfNew = ps.concat([split_col,split_col2],axis=1,ignore_index=True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_1455538/463168233.py in <module>
      2 split_col = split_col.toPandas()
      3 split_col2 = split_col2.toPandas()
----> 4 dfNew = ps.concat([split_col,split_col2],axis=1,ignore_index=True)

/home/anaconda3/envs/virtenv/lib/python3.10/site-packages/pyspark/pandas/namespace.py in concat(objs, axis, join, ignore_index, sort)
   2464     for obj in objs:
   2465         if not isinstance(obj, (Series, DataFrame)):
-> 2466             raise TypeError(
   2467                 "cannot concatenate object of type "
   2468                 "'{name}"

TypeError: cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid

type(split_col)
pandas.core.frame.DataFrame
type(split_col2)
pandas.core.frame.DataFrame

I want to concatenate 2 dataframe but I stuck. Do you have any suggestion?

CodePudding user response：

You're having this error because you're trying to concatenate two pandas DataFrames using the Pandas API for pyspark.

Instead of converting your pyspark dataframes to pandas dataframes using the toPandas() method, try the following:

split_col = split_col.to_pandas_on_spark()
split_col2 = split_col2.to_pandas_on_spark()