Pyspark: Converting a sample to Pandas Dataframe-CodePudding

I trying to extract a sample from a dataframe (df_spark) with 100 million rows and converting it to a pandas dataframe using the below code:

df = df_spark.sample(withReplacement = False, fraction = 0.05, seed = 11).collect().toPandas()

Unfortunately, I'm getting the following error:

AttributeError: 'list' object has no attribute 'toPandas'

I also tried to convert it to rdd and then to pandas and got the same error.

I'm wondering to know once I have the sample list what is the correct method to convert it to a pandas dataframe or a spark dataframe?

CodePudding user response：

I solve this issue first converting the sample to rdd, then to spark.DataFrame and last converting to Pandas as code below:

df = (df_spark.sample(withReplacement = False, fraction = 0.05, seed = 11)
              .rdd
              .toDF()
              .toPandas())

CodePudding user response：

There is no need to call collect() here. The sample() function returns a DataFrame object and the code can be as simple as:

df = df_spark.sample(withReplacement = False, fraction = 0.05, seed = 11).toPandas()