Home > other >  Spark machine learning repository of the data types -- the scala version
Spark machine learning repository of the data types -- the scala version

Time:09-20

1. Local vector
Local Vector base class is a Vector, we provide two implementations DenseVector and SparseVector, we suggest that by Vectors in the implementation of the factory method to create local Vector: (note: the Scala language. Default is introduced in the Scala collections. Immutable. Vector, in order to use MLlib Vector, you must show the introduction of org. Apache. The spark. MLlib. Linalg. Vector,)
 import org. Apache. Spark. Mllib. Linalg. {Vector, Vectors} 

//Create a dense vector (1.0, 0.0, 3.0).
Val dv: Vector=Vectors. Dense (1.0, 0.0, 3.0)

//Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values
Corresponding to the nonzero entries.

Val sv1: Vector=Vectors. Point (3, Array (0, 2), Array (1.0, 3.0))

//Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
Val sv2: Vector=Vectors. Point (3, Seq ((0, 1.0), (3.0), 2))


2. Containing class label point
Contain class label points through case class LabeledPoint,
 import org. Apache. Spark. Mllib. Linalg. Vectors 
The import org. Apache. Spark. Mllib. Regression. LabeledPoint

//Create a labeled point with a positive label and a dense feature vector.
Val pos=LabeledPoint (1.0, Vectors. Dense (1.0, 0.0, 3.0))

//Create a labeled point with a negative label and a sparse feature vector.
Val neg=LabeledPoint (0.0, Vectors. The point (3, Array (0, 2), Array (1.0, 3.0)))


3. Sparse data Sparse data
Is common practice, sparse data, MLlib can read stored in LIBSVM format of training examples, LIBSVM format is LIBSVM and LIBLINEAR default format, this is a text format, each row represents a sparse characteristic vector containing class label, format is as follows:
Label index1: value1 index2: value2...
Increasing index began and 1, after completion of loading, the index is converted to starting from 0,
Read through MLUtils. LoadLibSVMFile training instances and stored in LIBSVM format,
 import org. Apache. Spark. Mllib. Regression. LabeledPoint 
The import org. Apache. Spark. Mllib. Util. MLUtils
The import org. Apache. Spark. RDD. RDD

Val examples: RDD. [LabeledPoint]=MLUtils loadLibSVMFile (sc, "data/mllib/sample_libsvm_data. TXT")


4. The local matrix
Of a local matrix by integer index data and double value of the corresponding data, stored in a machine, MLlib support dense matrix (no sparse matrix!) , the entity value stored in the form of column priority in a double array,
This Matrix is the base class Matrix, we provide a real DenseMatrix now, we suggest by Matrices in the implementation of the factory method to create local Matrix:
 import org. Apache. Spark. Mllib. Linalg. {Matrix, Matrices} 
//Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))

Val dm: Matrix=Matrices. Dense (3, 2, Array (1.0, 3.0, 5.0, 2.0, 4.0, 6.0))


5. Distributed matrix

A distributed index data matrix by the long procession and double value of the corresponding data, distributed storage in one or more RDD, for large distributed in the matrix, choosing the right storage format is very important, a distributed matrix into a different format need global shuffle (shuffle), so the price is very high, at present, has realized the three kinds of distributed matrix storage format, is the most basic type RowMatrix, a RowMatrix is a line-oriented distributed matrix, the adults there is no specific meaning index, such as a series of characteristic vector of a collection, through a RDD to represent all rows, each row is a local vector, for RowMatrix, we assume that the number of columns is not large, so a local vector can be properly and drive node (driver) to exchange information, and can be stored in a node and operating,
IndexedRowMatrix and RowMatrix are similar, but there are a row index, can be used to identify the rows and join operation, and CoordinateMatrix is a triple list format (coordinate list and COO) storage distributed matrix, the entity set is a RDD, note: because we need to slow memory matrix is big is small, cloth type matrix of bottom layer RDD is must determine (deterministic), generally speaking, the use of certain RDD (non - deterministic RDDs) can lead to errors,

5.1 line-oriented distributed matrix (RowMatrix)

A RowMatrix is a line-oriented distributed matrix, the adults there is no specific meaning index, such as a series of characteristic vector of a collection, through a RDD to represent all rows, each row is a local vector, since each line consists of a local vectors, so the number of columns is integer data size limit, the number of columns in the practice is actually a very small number,
A RowMatrix are available from a RDD Vector instance creation, and then we can calculate the summary statistics,
 import org. Apache. Spark. Mllib. Linalg. Vector 
The import org. Apache. Spark. Mllib. Linalg. Distributed. RowMatrix

Val rows: RDD (Vector)=...//an RDD of local vectors

//Create a RowMatrix from an RDD [Vector].
Val mat: RowMatrix=new RowMatrix (rows)

//Get its size.
Val m=mat numRows ()
Val n=mat. NumCols ()


5.2 row index matrix (IndexedRowMatrix)
IndexedRowMatrix and RowMatrix are similar, but the adults index has a certain meaning, is essentially a row of data containing index information collection (an RDD of indexed rows), each line consists of long type index and a local Vector, a IndexedRowMatrix are available from a RDD IndexedRow instance creation, the IndexedRow here is (long, Vector) seal class, carved in the IndexedRowMatrix rope believe interest rates will change into a RowMatrix,

 import org. Apache. Spark. Mllib. Linalg. Distributed. {IndexedRow IndexedRowMatrix, RowMatrix} 
Val rows: RDD [IndexedRow]=...//an RDD of indexed rows

//Create an IndexedRowMatrix from an RDD [IndexedRow].
Val mat: IndexedRowMatrix=new IndexedRowMatrix (rows)

//Get its size.
Val m=mat numRows ()
Val n=mat. NumCols ()

//Drop its row indices.
Val rowMat: RowMatrix=mat. ToRowMatrix ()


5.3 triple matrix (CoordinateMatrix)
An CoordinateMatrix is a distributed matrix, the entity set is a RDD, each entity is a (j: I: Long, Long, value: Double) triples, which represents the row index, I j represents the column index, value represents the value of the entity, only when the row and column of the matrix is very huge, and use CoordinateMatrix matrix is sparse,
A CoordinateMatrix are available from a RDD MatrixEntry instance creation, the MatrixEntry here is (Long, Long, Double) seal of class, through tuning with toIndexedRowMatrix can transform a CoordinateMatrix as a IndexedRowMatrix (but the adults are sparse), now temporarily does not support other computing operations,
 import org. Apache. Spark. Mllib. Linalg. Distributed. {CoordinateMatrix, MatrixEntry} 
Val entries: RDD [MatrixEntry]=... nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull
  • Related