1.6 Spark turn read CSV file parsing a lot field DF problem-CodePudding

Now have a CSV file, more than 1000 fields, SPARK version is lower 1.6
To do the data analysis, so that digital variable is normalized, character variables using hash value can also be temporarily, so have a doubleType and few StringType
More than 1000 schema I listed into the field name, field type to spark is generated in the schema=structType (Array [structField])
Then the old version of the API method to get data
RDD=sc. Textfile (" sample. TXT "). The split (", "). The map (attribute=& gt;
Row (attribute (0). ToDouble, attribute (1) toDouble, attribute. (2) toDouble, attribute (3) toDouble, attribute (4) toDouble... )
//here to choose according to array element subscript attribute (I) is converted to double type or a string type
)
DF=spark createDataFrame (RDD, schema)

Then I discovered that the Row this API is very foolish, only the constructor and accept seq two
I want to put the attribute (I) such as j processed in a Row in the array can't seem to accept

What is the normal method combined into DataFrame, combination of DF I want to be in the SQL to calculate the maximum minimum value for each variable finally normalized (mllib normalization have the element in the Vector. The dense feel more trouble)
Kneeling consult

http://bbs.ngacn.cc/read.php? & Tid=12301156

CodePudding user response:

Try this:
https://github.com/databricks/spark-csv