How to calculate a correlation matrix in Spark using scala?-CodePudding

In python pandas , when I have a dataframe df like this

c1	c2	c3
0.1	0.3	0.5
0.2	0.4	0.6

I can use df.corr() to calculate a correlation matrix .

How do I do that in spark with scala ?

I have read the official document , The data struct isn't like above . I don't know how to transfer it .

CodePudding user response：

You can solve your problem with the following code. It will apply the Pearson correlation which is also standard for the Pandas function.

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.stat.Correlation

val df = Seq(
    (0.1, 0.3, 0.5),
    (0.2, 0.4, 0.6),
).toDF("c1", "c2", "c3")

val assembler = new VectorAssembler()
  .setInputCols(Array("c1", "c2", "c3"))
  .setOutputCol("vectors")

val transformed = assembler.transform(df)

val corr = Correlation.corr(transformed, "vectors").head

println(s"Pearson correlation matrix:\n $corr")