Joining two dataframe of one column generated with spark-CodePudding

I'm working with pyspark and pandas in Databricks. I'm generating the two following dataframe:

start_date = datetime.today() - timedelta(days=60)
end_date = datetime.today()
date_list = pd.date_range(start=start_date,end=end_date).strftime('%Y-%m-%d').tolist()
date_df = spark.createDataFrame(date_list, 'string').toDF("date")

and

random_list = np.random.normal(loc=50, scale=10, size=61)
random_list = [round(i) for i in random_list]
integer_df = spark.createDataFrame(random_list, 'integer').toDF("value")

so I have two dataframes of one column each ("date" and "value") of the same length and I'd like to "merge" them into one dataframe.

I've tried this:

integer_df=pd.concat(date_df)

which is returning the following error first argument must be an iterable of pandas-on-Spark objects, you passed an object of type "DataFrame"

and this

test_df = pd.concat([integer_df, date_df], axis=1, join='inner')

which is returning the following error cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid

Mostly I'd like to understand these errors.

CodePudding user response：

From what i could see you are not transitioning the objects correctly, for example you are trying to concatenate a sparkdf object to a pandasdf object.

first argument must be an iterable of pandas-on-Spark objects, you passed an object of type "DataFrame"

This guy was caused because because, you passed the wrong type object. To concatenate. You should try using pandas on spark object or just pandas objects, if you are going to use pandas.

So to fix your first error, i would just follow the convention. Work with the objects of the given library.

Something like this (or maybe just use pd.Series() or pd.DataFrame)

date_df = spark.createDataFrame(date_list, 'string').toPandas()
# toDF("date") is redundant, either use createDataFrame or toDf not both
integer_df = spark.createDataFrame(random_list, 'integer').toPandas()

After that try utilizing pd.concat([]), with the give results.

Your second error, was caused because pandas has a given condition to only accept type Series object (something similar to your list), since you are passing a pyspark df well i guess pandas gets confused and read it as a list.

So to fix it again utilize the correct object of the library, or transform it to numpy if you want something more efficient

Hope this helps.