I have some data in S3 location in json format. It have 4 columns val
, time__stamp
, name
and type
. I would like to create an external Athena table from this data with some transformations given below:
timestamp
: timestamp should be converted from unix epoch to UTC, this I did by using the timestamp data type.name
: name should filtered with following sql logic:
name not in ('abc','cdf','fgh') and name not like '%operator%'
type
: type should not have values labeled ascounter
- I would like to add two partition columns
date
andhour
which should be derived from time__stamp column
I started with following:
CREATE EXTERNAL TABLE `airflow_cluster_data`(
`val` string COMMENT 'from deserializer',
`time__stamp` timestamp COMMENT 'from deserializer',
`name` string COMMENT 'from deserializer',
`type` string COMMENT 'from deserializer')
PARTITIONED BY (
date,
hour)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.time_stamp'='@timestamp')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket1/raw/airflow_data'
I tried various things but couldn't figure out the syntax. Using spark could have been easier but I don't want to run Amazon EMR every hour for small data set. I prefer to do it in Athena if possible.
Please have a look at some sample data:
1533,1636674330000,abc,counter
1533,1636674330000,xyz,timer
1,1636674330000,cde,counter
41,1636674330000,cde,timer
1,1636674330000,fgh,counter
231,1636674330000,xyz,timer
1,1636674330000,abc,counter
2431,1636674330000,cde,counter
42,1636674330000,efg,timer
CodePudding user response:
You can create you own UDF for transformation and use it in Athena. https://docs.aws.amazon.com/athena/latest/ug/querying-udf.html
CodePudding user response:
Probably the simplest method is to create a View:
CREATE VIEW foo AS
SELECT
val,
cast(from_unixtime(time__stamp / 1000) as timestamp) as timestamp,
cast(from_unixtime(1636674330000/1000) as date) as date,
hour(cast(from_unixtime(1636674330000/1000) as timestamp)) as hour,
name,
type
FROM airflow_cluster_data
WHERE name not in ('abc','cdf','fgh')
AND name not like '%operator%'
AND type != 'counter'