Should we always use rdd.count() instead of rdd.collect().size-CodePudding

rdd.collect().size will first move all data to driver, if the dataset is large, it could resutl in OutOfMemoryError.

So, should we always use rdd.count() instead?

Or in other words, in what situation, people would prefer rdd.collect().size?

CodePudding user response：

Assuming you're using the Scala size function on the array returned by rdd.collect() I don't see any advantage of collecting the whole RDD just to get its number of rows.

This is the point of RDDs, to work on chunks of data in parallel to make transformations manageable. Usually the result is smaller than the original dataset because the given data is somehow transformed/filtered/synthesized.

collect usually comes at the end of data processing and if you run an action you might also want to save the data since might require some expensive computations and the collected data is presumably interesting/valuable.

CodePudding user response：

collect causes data to be processed and then fetched to the driver node.

For count you don't need:

Full processing - some columns may not be required to be fetched or calculated e.g. not included in any filter. You don't need to load, process or transfer the columns that don't effect the count.
Fetch to driver node - each worker node can count it's rows and the counts can be summed up.

I see no reason for calling collect().size.

Just for general knowledge, there is another way to get around #2, however, for this case it is redundant and won't prevent #1: rdd.forEachPartition(p => p.size).agg(r => r.sum())