Home > Mobile >  partitioned json queries in athena return no data
partitioned json queries in athena return no data

Time:11-29

I have been trying to setup an athena database for sometime, I seem to have the database setup correctly but when I query it returns no data. The data I am querying is a series of partitioned S3 files in the structure of

"S3://bucket_name/data1=partition_1/data2=partition_2/data3=partition_3/data4=partition_4/file.json"

there can be multiple file.json in one partition e.g.

"S3://bucket_name/data1=partition_1/data2=partition_2/data3=partition_3/data4=partition_4/file1.json"
"S3://bucket_name/data1=partition_1/data2=partition_2/data3=partition_3/data4=partition_4/file2.json"

Below are the queries I am running along with the create command and the data stored

CREATE EXTERNAL TABLE bench_logs (
  id string,
  filename string,
  data struct<transmit_start: timestamp, 
                     transmit_end:timestamp, 
                     transfer_start:timestamp,
                     transfer_end:timestamp,
                     processing_start:timestamp, 
                     processing_end:timestamp>
  )
PARTITIONED BY ( 
  partition_1 string, 
  partition_2 string, 
  partition_3 date, 
  partition_4 string
  )
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
LOCATION
  's3://benchmark-files/complete/'
TBLPROPERTIES (
  'classification'='json',
  'storage.location.template'='s3://iceqube-benchmark-files/complete/partition_1=${partition_1}/partition_2=${partition_2}/partition_3=${partition_3}/partition_4=${partition_4}/')

that table is being queried like:

SELECT id FROM "benchmark"."bench_logs"
WHERE partition_1='foo'
AND partition_2='bar' 
AND partition_3=cast('1970-01-01' as date) 
AND partition_4='09:30:00';

the query says it ran correctly but i see no data other than column headers.

if anymore data is needed ill provide it, i have been stuck a few days now and cant get my head around it at all. thanks in advance.

CodePudding user response:

Before you can query a partitioned table you must add the partitions to it. This can be done with ALTER TABLE bench_logs ADD PARTITION …, or by using partition projection, as well as other ways.

Also, you seem to have mixed up the keys and values of your Hive partition scheme: if a partition key is called partition_1 the S3 URI should be …/partition_1=data_1/…, not …/data_1=partition_1/….

CodePudding user response:

Coming Back incase anyone else has these issues.

I followed Theo's advice and still had problems querying.

turns out the problem was having a partition with a value containing ":" as this is a restricted character in S3 normally but as i was writing programatically it was allowed through.

a full explanation of this is answered better here

  • Related