I have s3 bucket as below,
s3://rawzone/toyoto/2020-11-04/
inside each date I have 4 files
s3://rawzone/toyoto/2020-11-04/Company.json
s3://rawzone/toyoto/2020-11-04/sales.json
s3://rawzone/toyoto/2020-11-04/transport.json
s3://rawzone/toyoto/2020-11-04/preaquisitions.json
and I have so many dates like this
s3://rawzone/toyoto/2020-11-05/Company.json
s3://rawzone/toyoto/2020-11-05/sales.json
s3://rawzone/toyoto/2020-11-05/transport.json
s3://rawzone/toyoto/2020-11-05/preaquisitions.json
I need to create a table for across all dates for Company,sales,transport,preaquisitions
How can I retrieve this data from s3 bucket for each table?
So far tried (kept ??? where I don't know how to replace it)
create or replace external table toyoto.PB_compnay
(
columns mapping......
)
partition by ???
with location = @raw_zone/pitchbook/??????
file_format = json_format
aws_sns_topic='arn:aws:sns:us-west-1:5438:dev-autore'
auto_refresh = true
CodePudding user response:
The practice is to partition the data files based on increments of time; or, if the data files are staged from multiple sources, to partition by a data source identifier and date or timestamp.
If you prefer to have 4 tables based on the filenames, I would recommend having a pattern that would pick up all the files that have that name.
create or replace external table toyoto.PB_compnay
(
columns mapping......
)
with location = @stage/path/
file_format = json_format
pattern='.*/.*/Company[.]json'
aws_sns_topic='arn:aws:sns:us-xxx:xxx:xxxx'
auto_refresh = true