I have read a csv file and need to find correlation between two columns.
I am using df.stat.corr('Age','Exp') and result is 0.7924058156930612. But I want to have this result stored in another dataframe with header as "correlation".
correlation
0.7924058156930612
CodePudding user response:
Following up on what @gupta_hemant commented.
You can create a new column as
df.withColumn("correlation", df.stat.corr("Age", "Exp").collect()[0].correlation)
(I am guessing the exact syntax here, but it should be something like this)
After reviewing the code, the syntax should be
import pyspark.sql.functions as F
df.withColumn("correlation", F.lit(df.stat.corr("Age", "Exp")))
CodePudding user response:
Try this and let me know.
corrValue = df.stat.corr("Age", "Exp")
newDF = spark.createDataFrame(
[
(corrValue)
],
["corr"]
)
CodePudding user response:
@gupta_hemant this is my code, but when I use df_sol.show()
it throws error
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.types import DoubleType,StructType,StringType,StructField
rdd = spark.sparkContext.parallelize([
Row(Stats='Co-variance', Value=df.stat.cov('rand1','rand2')),
Row(Stats='Correlation', Value=df.stat.corr('rand1','rand2'))])
schema = StructType([
StructField("Stats", StringType(), True),
StructField("Value", DoubleType(), True)])
df_sol = spark.createDataFrame(rdd, schema)