I trying to extract a sample from a dataframe (df_spark
) with 100 million rows and converting it to a pandas dataframe using the below code:
df = df_spark.sample(withReplacement = False, fraction = 0.05, seed = 11).collect().toPandas()
Unfortunately, I'm getting the following error:
AttributeError: 'list' object has no attribute 'toPandas'
I also tried to convert it to rdd and then to pandas and got the same error.
I'm wondering to know once I have the sample list what is the correct method to convert it to a pandas dataframe or a spark dataframe?
CodePudding user response:
I solve this issue first converting the sample to rdd, then to spark.DataFrame
and last converting to Pandas
as code below:
df = (df_spark.sample(withReplacement = False, fraction = 0.05, seed = 11)
.rdd
.toDF()
.toPandas())
CodePudding user response:
There is no need to call collect()
here. The sample()
function returns a DataFrame object and the code can be as simple as:
df = df_spark.sample(withReplacement = False, fraction = 0.05, seed = 11).toPandas()