I have a file that looks like this:
33.49.147.163 20140416123526 https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en 29 409 Firefox/5.0
I want to load it into a hive table. I do it this way:
create external table Logs (
ip string,
ts timestamp,
request string,
page_size smallint,
status_code smallint,
info string
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties (
"timestamp.formats" = "yyyyMMddHHmmss",
"input.regex" = '^(\\S*)\\t{3}(\\d{14})\\t(\\S*)\\t(\\S*)\\t(\\S*)\\t(\\S*).*$'
)
stored as textfile
location '/data/user_logs/user_logs_M';
And
select * from Logs limit 10;
results in
33.49.147.16 NULL https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en 29 409 Firefox/5.0
How to parse timestamps correctly, to avoid this NULLs?
CodePudding user response:
"timestamp.formats"
SerDe property works only with LazySimpleSerDe (STORED AS TEXTFILE
), it does not work with RegexSerDe
. If you are using RegexSerDe, then parse timestamp in a query.
Define ts
column as STRING
data type in CREATE TABLE and in the query transform it like this:
select timestamp(regexp_replace(ts,'(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})','$1-$2-$3 $4:$5:$6.0')) as ts
Of course, you can extract each part of the timestamp using SerDe as separate columns and properly concatenate them with delimiters in the query to get correct timestamp format, but it will not give you any improvement because anyway you will need additional transformation in the query.