Home > Software design >  How to calculate a correlation matrix in Spark using scala?
How to calculate a correlation matrix in Spark using scala?

Time:03-14

In python pandas , when I have a dataframe df like this

c1 c2 c3
0.1 0.3 0.5
0.2 0.4 0.6

I can use df.corr() to calculate a correlation matrix .

How do I do that in spark with scala ?

I have read the official document , The data struct isn't like above . I don't know how to transfer it .

CodePudding user response:

You can solve your problem with the following code. It will apply the Pearson correlation which is also standard for the Pandas function.

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.stat.Correlation

val df = Seq(
    (0.1, 0.3, 0.5),
    (0.2, 0.4, 0.6),
).toDF("c1", "c2", "c3")

val assembler = new VectorAssembler()
  .setInputCols(Array("c1", "c2", "c3"))
  .setOutputCol("vectors")

val transformed = assembler.transform(df)

val corr = Correlation.corr(transformed, "vectors").head

println(s"Pearson correlation matrix:\n $corr")
  • Related