To do the data analysis, so that digital variable is normalized, character variables using hash value can also be temporarily, so have a doubleType and few StringType
More than 1000 schema I listed into the field name, field type to spark is generated in the schema=structType (Array [structField])
Then the old version of the API method to get data
RDD=sc. Textfile (" sample. TXT "). The split (", "). The map (attribute=& gt;
Row (attribute (0). ToDouble, attribute (1) toDouble, attribute. (2) toDouble, attribute (3) toDouble, attribute (4) toDouble... )
//here to choose according to array element subscript attribute (I) is converted to double type or a string type
)
DF=spark createDataFrame (RDD, schema)
Then I discovered that the Row this API is very foolish, only the constructor and accept seq two
I want to put the attribute (I) such as j processed in a Row in the array can't seem to accept
What is the normal method combined into DataFrame, combination of DF I want to be in the SQL to calculate the maximum minimum value for each variable finally normalized (mllib normalization have the element in the Vector. The dense feel more trouble)
Kneeling consult
http://bbs.ngacn.cc/read.php? & Tid=12301156
CodePudding user response:
Try this:https://github.com/databricks/spark-csv