Home > Back-end >  Is there any way to count how many partitions reached in Spark query on Hadoop?
Is there any way to count how many partitions reached in Spark query on Hadoop?

Time:01-04

I want to stop a Spark query if the query takes more than 10 mins.

But this is for just for one partition.

I mean if query reaches 2 partition in Hadoop so the time will be 20 mins.

For example, for this I need a 10 mins threshold:

SELECT Max(col1),
       Min(col2)
FROM   my_parititoned_table_on_hadoop
WHERE  partitioned_column = 1 

For this I need a 20 mins threshold:

SELECT Max(col1),
       Min(col2)
FROM   my_parititoned_table_on_hadoop
WHERE  partitioned_column IN ( 1, 2 )

Is this possible?

CodePudding user response:

No. There is no such support in Spark.

CodePudding user response:

The answer to the question in your title ("Is there any way to count how many partitions...") is a "yes" if your data is stored as parquet. You can run explain() on your query and see how many partitions will be scanned during query execution. For example

scala> spark.sql("select * from tab where p > '1' and p <'4'").explain()
== Physical Plan ==
*(1) FileScan parquet default.tab[id#375,desc#376,p#377] Batched: true, Format: Parquet, 
     Location: PrunedInMemoryFileIndex[hdfs://ns1/user/hive/warehouse/tab/p=2, hdfs://ns1/user/hive/warehouse..., 
     **PartitionCount: 2,** PartitionFilters: [isnotnull(p#377), (p#377 > 1), (p#377 < 4)], 
     PushedFilters: [], ReadSchema: struct<id:int,desc:string>

...from which PartitionCount: x can be parsed quite easily.

The second question (which is technically a statement -- "I want to stop a Spark query if the query takes more than 10 mins") is a "no", just as @thebluephantom said.

  •  Tags:  
  • Related