There is Hive table with ~ 500,000 rows. It has the single column which keeps the JSON string. JSON stores the measurements from 15 devices organized like this:
company_id=…
device_1:
array of measurements
every single measurements has 2 attributes:
value=
date=
device_2:
…
device_3
…
device_15
...
There are 15 devices in json where every device has the nested array of measurements inside. The size of measurements array is not fixed.
The goal is to get from the measurements only the one with max(date) per device.
The output of SELECT should have the following columns:
company_id
device_1_value
device_1_date
...
device_15_value
device_15_date
I tried to use the LATERAL VIEW to explode the measurements array:
SELECT get_json_object(json_string,'$.company_id),
d1.value, d1.date, ... d15.value, d15.date
FROM T
LATERAL VIEW explode(device_1.measurements) as d1
LATERAL VIEW explode(device_2.measurements) as d2
…
LATERAL VIEW explode(device_15.measurements) as d15
I can use the result of this SQL as an input for another SQL which will extract the records with max(date) per device.
My approach does not scale well: with 15 devices and 2 measurements per device the single row in input table will generate 2^15 = 32,768 rows using my SQL above.
There are 500,000 rows in input table.
CodePudding user response:
You are actually in a great position, to make a cheaper table/join. Bundling (your JSON string) is a optimization trick use to take horribly ugly joins/tables and optimizing them.
The downside is that you should likely be using a hive user defined function or a spark function to pair down the data. SQL is amazing but likely this isn't the right tool for this job. You likely want to use a programming language to help ingest this data into a format that works for SQL.
CodePudding user response:
To avoid the cartesian product generated by multiple lateral views I split the original SQL into 15 independent SQLs (one per device) where the single SQL has just 1 lateral view.
Then I join all 15 SQLs.