save multiple outcomes into a dataframe and covert it into csv file in pyspark-CodePudding

I need to create a CSV output file that lists the minimum price, maximum price, and total row count from a data set. The expected output (CSV file) should have as column names: min_price, max_price and total_rows. How can I merge the output I have into a dataframe and then save it as CSV with these three columns?

So far I have done:

df=spark.read.parquet('mydataframe', inferSchema=True)

price=df.select('price')

max=df.agg({'price': 'max'}).show()
min=df.agg({'price': 'min'}).show()
df.count()

CodePudding user response：

Seems like a simple aggregation, you shouldn't split it but just select using single dataframe action.

result = (
price
.select(
     F.min('price').alias('minimum_price'),
     F.max('price').alias('maximum_price'), 
     F.count("*").alias("total_of_rows")
    )
)

Save the result to wherever you want - If you need it as a single csv file use coalesce(1) (price.coalesce(1).write.csv("your_path", header=True)

CodePudding user response：

This answer was modified after @Benny Elgazar pointed out my solution had a lot of issues. Basically, it's his answer from below but with just working with an exemple:

df = spark.createDataFrame(
[(1.40,),
(1.43,),
(1.35,),
(1.38,),
(1.39,)]
,['price',])

Then, creating a result dataframe using collect

result = (
df
.select(
     F.min('price').alias('minimum_price'),
     F.max('price').alias('maximum_price'), 
     F.count("*").alias("total_of_rows")
    )
)

result.show()

 ------------- ------------- ------------- 
|minimum_price|maximum_price|total_of_rows|
 ------------- ------------- ------------- 
|         1.35|         1.43|            5|
 ------------- ------------- -------------

Now, write the result df in a csv file:

result.coalesce(1).write.option("header", "true").csv("my_folder/", 'overwrite')

You cannot choose the name of the csv file. You can though rename it. Check this post for some ideas