How to get all data in rdd pipeline in Spark?-CodePudding

After I have loaded all data I needed and did the mapping, I used the following code to extract n-elements using the take() function and get the data in rdd format.

print(data.take(10))

But if I want to take all data (it could be thousands or more rows) what code shall I write to extract all data?

Thank you in advance.

CodePudding user response：

The take will accept only Int. This .take() method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory. If you are trying to enter any number beyond the Integer range, it will give an error, like

"error: type mismatch; found: Long required: Int"

this is not useful for millions of data. Useful when the result is small or in other words the number or size is an integer.

You can use other actions like collect(), to get all the data in the RDD