I have a partitioned external table Hive that i have to overwrite with some records. There are a lot of dates that we need to reload and the queries are a bit heavy. What we want to know is if it is possible, in a simultaneous way load two or more different partitions at the same time? For example, 3 (or more) processes running in parallel like:
Process1
insert overwrite table_prod partition (data_date)
select * from table_old where data_date=20221110;
Process2
insert overwrite table_prod partition (data_date)
select * from table_old where data_date=20221111;
Process3
insert overwrite table_prod partition (data_date)
select * from table_old where data_date=20221112;
CodePudding user response:
Short answer is yes you can.
Real question is how - because you have to consider large volume of data.
Option 1 - yes, you can use shell script or some scheduler tool. But the query you're using is going to be slow. you can use static partitioning which is way faster.
insert overwrite table_prod partition (data_date=20221110) -- pls note i mentioned partition value.
select
col1, col2... -- exclude data_date column from select list
from table_old where data_date=20221110;
Option 2 - You can also use dynamic partition scheme to load all the partitions at once. This is perf intensive operation but you dont have to create any shell script or any other process.
insert overwrite table_prod partition (data_date)
select * from table_old