Is there any way to count how many partitions reached in Spark query on Hadoop?-CodePudding

I want to stop a Spark query if the query takes more than 10 mins.

But this is for just for one partition.

I mean if query reaches 2 partition in Hadoop so the time will be 20 mins.

For example, for this I need a 10 mins threshold:

SELECT Max(col1),
       Min(col2)
FROM   my_parititoned_table_on_hadoop
WHERE  partitioned_column = 1

For this I need a 20 mins threshold:

SELECT Max(col1),
       Min(col2)
FROM   my_parititoned_table_on_hadoop
WHERE  partitioned_column IN ( 1, 2 )

Is this possible?

CodePudding user response：

No. There is no such support in Spark.

CodePudding user response：

The answer to the question in your title ("Is there any way to count how many partitions...") is a "yes" if your data is stored as parquet. You can run explain() on your query and see how many partitions will be scanned during query execution. For example

scala> spark.sql("select * from tab where p > '1' and p <'4'").explain()
== Physical Plan ==
*(1) FileScan parquet default.tab[id#375,desc#376,p#377] Batched: true, Format: Parquet, 
     Location: PrunedInMemoryFileIndex[hdfs://ns1/user/hive/warehouse/tab/p=2, hdfs://ns1/user/hive/warehouse..., 
     **PartitionCount: 2,** PartitionFilters: [isnotnull(p#377), (p#377 > 1), (p#377 < 4)], 
     PushedFilters: [], ReadSchema: struct<id:int,desc:string>

...from which PartitionCount: x can be parsed quite easily.

The second question (which is technically a statement -- "I want to stop a Spark query if the query takes more than 10 mins") is a "no", just as @thebluephantom said.