I am new in hive and at my current work I saw an approach where multiple tables are created based on date. E.g TableA_20220701, TableA_20220702 and so on …
I am wondering if this is any better than just creating one table that is partitioned by date.
CodePudding user response:
preferable - one table with partition on date column
One table with partition on date_column -
Pros -
- Easily manageable object with partitions - they are actually like separate table and put into separate folder. You can drop/create partitions easily.
- Your application/tool doesnt need to know exact partition name. So, for apps its just one object.
- You can query on whole dataset.
Cons - You need to have an extra partition_col in the end of the table.
Multiple tables -
Pros -
You dont need extra column.
Cons -
- All pro from option 1.
- you will end up creating 100s of tables after 1 year of run. Very difficult to manage.
- if you want to query all the data, you have to union them all together into one query which will become tedious over time.