I want to stop a Spark query if the query takes more than 10 mins.
But this is for just for one partition.
I mean if query reaches 2 partition in Hadoop so the time will be 20 mins.
For example, for this I need a 10 mins threshold:
SELECT Max(col1),
Min(col2)
FROM my_parititoned_table_on_hadoop
WHERE partitioned_column = 1
For this I need a 20 mins threshold:
SELECT Max(col1),
Min(col2)
FROM my_parititoned_table_on_hadoop
WHERE partitioned_column IN ( 1, 2 )
Is this possible?
CodePudding user response:
No. There is no such support in Spark.
CodePudding user response:
The answer to the question in your title ("Is there any way to count how many partitions...") is a "yes" if your data is stored as parquet. You can run explain()
on your query and see how many partitions will be scanned during query execution. For example
scala> spark.sql("select * from tab where p > '1' and p <'4'").explain()
== Physical Plan ==
*(1) FileScan parquet default.tab[id#375,desc#376,p#377] Batched: true, Format: Parquet,
Location: PrunedInMemoryFileIndex[hdfs://ns1/user/hive/warehouse/tab/p=2, hdfs://ns1/user/hive/warehouse...,
**PartitionCount: 2,** PartitionFilters: [isnotnull(p#377), (p#377 > 1), (p#377 < 4)],
PushedFilters: [], ReadSchema: struct<id:int,desc:string>
...from which PartitionCount: x
can be parsed quite easily.
The second question (which is technically a statement -- "I want to stop a Spark query if the query takes more than 10 mins") is a "no", just as @thebluephantom said.