Home > Software design >  how to add column name to the dataframe storing result of correlation of two columns in pyspark?
how to add column name to the dataframe storing result of correlation of two columns in pyspark?

Time:12-26

I have read a csv file and need to find correlation between two columns.

I am using df.stat.corr('Age','Exp') and result is 0.7924058156930612. But I want to have this result stored in another dataframe with header as "correlation".

correlation

0.7924058156930612

CodePudding user response:

Following up on what @gupta_hemant commented.

You can create a new column as

df.withColumn("correlation", df.stat.corr("Age", "Exp").collect()[0].correlation)

(I am guessing the exact syntax here, but it should be something like this)

After reviewing the code, the syntax should be

import pyspark.sql.functions as F

df.withColumn("correlation", F.lit(df.stat.corr("Age", "Exp")))

CodePudding user response:

Try this and let me know.

corrValue = df.stat.corr("Age", "Exp")
newDF = spark.createDataFrame(
  [
    (corrValue)
  ],
  ["corr"]
)

CodePudding user response:

@gupta_hemant this is my code, but when I use df_sol.show() it throws error

from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.types import DoubleType,StructType,StringType,StructField

rdd = spark.sparkContext.parallelize([
Row(Stats='Co-variance', Value=df.stat.cov('rand1','rand2')),
Row(Stats='Correlation', Value=df.stat.corr('rand1','rand2'))])

schema = StructType([
StructField("Stats", StringType(), True),
StructField("Value", DoubleType(), True)])

df_sol = spark.createDataFrame(rdd, schema)
  • Related