How to use the spark implementation: call an external application or a dynamic link library function-CodePudding

The building Lord, with their own several machines built spark standalone cluster, used for machine learning: to generate samples are available, and feature extraction, training, predict
A group in the first step, the the original file with the a algorithm generate the processing of a batch of samples are available, and the is a algorithm written in C (executable and so library)
The problem here:
(1) the demand for this what is the solution?
(2) if the each file to create a RDD, map when use as a mapping relationship, so the library function can be implemented?
Thank you very much!

CodePudding user response:

Thank invited, but I didn't read, I can't use the spark,

CodePudding user response:

Don't know your external C program's input and output is what?
I problems similar to you now, my way is in RDD. MapPartitions in processing

CodePudding user response:

refer to the second floor someone1999 response:

don't know your external C program's input and output is what?
I met problems are similar and you now, my way is in RDD. MapPartitions handled within

I C program is for feature extraction, a number of audio files on the single input is an audio file path, the output is characteristic value set (writes TXT file),
My way is to read all of the audio file generated RDD, rather than reading the audio file itself, RDD a single element becomes a single filename,
Because I C program to complete audio file, if the file data is used to generate RDD, don't know how to realize the function to...

CodePudding user response:

Map processing is called the C program, parameter is to map the row data, also is the name of the file, wait for after the process of C program, the output file data the zha zha processing,

The focus is on the C program to deploy on all the work in advance, path consistent good,

Can not verify, you might suppose,

CodePudding user response:

File name set - & gt; The file name RDD - & gt; MapToPair (jni calls you c function) - & gt; Filename, characteristic collection PairRdd
Key is a function of c you can visit the HDFS, otherwise your audio data file to distribute to each worker nodes on the same path,
Or modification can receive the file content (i.e., an array of bytes) rather than the path to the file, then process becomes:
File name set - & gt; The file name RDD - & gt; MapToPair - & gt; File name, file content byte array RDD - & gt; The map (jni calls you c function) - & gt; The filename, characteristic collection PairRdd

CodePudding user response:

reference 5 floor link0007 reply:

file collection - & gt; The file name RDD - & gt; MapToPair (jni calls you c function) - & gt; Filename, characteristic collection PairRdd
Key is a function of c you can visit the HDFS, otherwise your audio data file to distribute to each worker nodes on the same path,
Or modification can receive the file content (i.e., an array of bytes) rather than the path to the file, then process becomes:
File name set - & gt; The file name RDD - & gt; MapToPair - & gt; File name, file content byte array RDD - & gt; The map (jni calls you c function) - & gt; Filename, characteristic collection PairRdd

Right, c function can access the HDFS is important, but for c function is so library use them directly, I don't have to write the interface make it can access JHDFS, but use the clumsy way: when the map according to the assigned to the file name of this node, transmitted from the master node corresponding documents to come over...
Feel this kind of practice is very failure, no use the characteristics of the spark, just use the RDD distribution...

CodePudding user response:

refer to 6th floor qq_22717679 response:

Quote: refer to the fifth floor link0007 reply:

File name set - & gt; The file name RDD - & gt; MapToPair (jni calls you c function) - & gt; Filename, characteristic collection PairRdd
Key is a function of c you can visit the HDFS, otherwise your audio data file to distribute to each worker nodes on the same path,
Or modification can receive the file content (i.e., an array of bytes) rather than the path to the file, then process becomes:
File name set - & gt; The file name RDD - & gt; MapToPair - & gt; File name, file content byte array RDD - & gt; The map (jni calls you c function) - & gt; Filename, characteristic collection PairRdd

Right, c function can access the HDFS is important, but for c function is so library use them directly, I don't have to write the interface make it can access JHDFS, but use the clumsy way: when the map according to the assigned to the file name of this node, transmitted from the master node corresponding documents to come over...
Feel this kind of practice is very failure, no use the characteristics of the spark, just use the RDD distribution...

Actually can put the audio files uploaded to the HDFS, executor program running time getting to the local temporary directory, let c visit, you will master node of the network IO fill, performance is very poor, can better accept byte array as a parameter, or can direct access to the HDFS