I have been trying to setup an athena database for sometime, I seem to have the database setup correctly but when I query it returns no data. The data I am querying is a series of partitioned S3 files in the structure of
"S3://bucket_name/data1=partition_1/data2=partition_2/data3=partition_3/data4=partition_4/file.json"
there can be multiple file.json in one partition e.g.
"S3://bucket_name/data1=partition_1/data2=partition_2/data3=partition_3/data4=partition_4/file1.json"
"S3://bucket_name/data1=partition_1/data2=partition_2/data3=partition_3/data4=partition_4/file2.json"
Below are the queries I am running along with the create command and the data stored
CREATE EXTERNAL TABLE bench_logs (
id string,
filename string,
data struct<transmit_start: timestamp,
transmit_end:timestamp,
transfer_start:timestamp,
transfer_end:timestamp,
processing_start:timestamp,
processing_end:timestamp>
)
PARTITIONED BY (
partition_1 string,
partition_2 string,
partition_3 date,
partition_4 string
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
LOCATION
's3://benchmark-files/complete/'
TBLPROPERTIES (
'classification'='json',
'storage.location.template'='s3://iceqube-benchmark-files/complete/partition_1=${partition_1}/partition_2=${partition_2}/partition_3=${partition_3}/partition_4=${partition_4}/')
that table is being queried like:
SELECT id FROM "benchmark"."bench_logs"
WHERE partition_1='foo'
AND partition_2='bar'
AND partition_3=cast('1970-01-01' as date)
AND partition_4='09:30:00';
the query says it ran correctly but i see no data other than column headers.
if anymore data is needed ill provide it, i have been stuck a few days now and cant get my head around it at all. thanks in advance.
CodePudding user response:
Before you can query a partitioned table you must add the partitions to it. This can be done with ALTER TABLE bench_logs ADD PARTITION …
, or by using partition projection, as well as other ways.
Also, you seem to have mixed up the keys and values of your Hive partition scheme: if a partition key is called partition_1
the S3 URI should be …/partition_1=data_1/…
, not …/data_1=partition_1/…
.
CodePudding user response:
Coming Back incase anyone else has these issues.
I followed Theo's advice and still had problems querying.
turns out the problem was having a partition with a value containing ":" as this is a restricted character in S3 normally but as i was writing programatically it was allowed through.
a full explanation of this is answered better here