In python pandas
, when I have a dataframe
df like this
c1 | c2 | c3 |
---|---|---|
0.1 | 0.3 | 0.5 |
0.2 | 0.4 | 0.6 |
I can use df.corr()
to calculate a correlation matrix .
How do I do that in spark with scala ?
I have read the official document , The data struct isn't like above . I don't know how to transfer it .
CodePudding user response:
You can solve your problem with the following code. It will apply the Pearson correlation which is also standard for the Pandas function.
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.stat.Correlation
val df = Seq(
(0.1, 0.3, 0.5),
(0.2, 0.4, 0.6),
).toDF("c1", "c2", "c3")
val assembler = new VectorAssembler()
.setInputCols(Array("c1", "c2", "c3"))
.setOutputCol("vectors")
val transformed = assembler.transform(df)
val corr = Correlation.corr(transformed, "vectors").head
println(s"Pearson correlation matrix:\n $corr")