how to add column name to the dataframe storing result of correlation of two columns in pyspark?-CodePudding

I have read a csv file and need to find correlation between two columns.

I am using df.stat.corr('Age','Exp') and result is 0.7924058156930612. But I want to have this result stored in another dataframe with header as "correlation".

correlation

0.7924058156930612

CodePudding user response：

Following up on what @gupta_hemant commented.

You can create a new column as

df.withColumn("correlation", df.stat.corr("Age", "Exp").collect()[0].correlation)

(I am guessing the exact syntax here, but it should be something like this)

After reviewing the code, the syntax should be

import pyspark.sql.functions as F

df.withColumn("correlation", F.lit(df.stat.corr("Age", "Exp")))

CodePudding user response：

Try this and let me know.

corrValue = df.stat.corr("Age", "Exp")
newDF = spark.createDataFrame(
  [
    (corrValue)
  ],
  ["corr"]
)

CodePudding user response：

@gupta_hemant this is my code, but when I use df_sol.show() it throws error

from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.types import DoubleType,StructType,StringType,StructField

rdd = spark.sparkContext.parallelize([
Row(Stats='Co-variance', Value=df.stat.cov('rand1','rand2')),
Row(Stats='Correlation', Value=df.stat.corr('rand1','rand2'))])

schema = StructType([
StructField("Stats", StringType(), True),
StructField("Value", DoubleType(), True)])

df_sol = spark.createDataFrame(rdd, schema)