I have a Hive table as below:
event_name | attendees_per_countries |
---|---|
a | {'US':5} |
b | {'US':4, 'UK': 3, 'CA': 2} |
c | {'UK':2, 'CA': 1} |
And I want to get a new table like below:
country | number_of_people |
---|---|
US | 9 |
UK | 5 |
CA | 4 |
How to write a query in Hive or Presto?
CodePudding user response:
You may use the following:
If the column type for attendees_per_countries
is a string, you may use the following:
WITH sample_data AS (
select
event_name,
str_to_map(
regexp_replace(attendees_per_countries,'[{|}]',''),
',',
':'
) as attendees_per_countries
FROM
raw_data
)
select
regexp_replace(cm.key,"[' ]","") as country,
SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC
However, if the column type for attendees_per_countries
is already a map
then you may use the following
select
regexp_replace(cm.key,"[' ]","") as country,
SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC
Full reproducible example below
with raw_data AS (
select 'a' as event_name, "{'US':5}" as attendees_per_countries
UNION ALL
select 'b', "{'US':4, 'UK': 3, 'CA': 2}"
UNION ALL
select 'c', "{'UK':2, 'CA': 1}"
),
sample_data AS (
select
event_name,
str_to_map(
regexp_replace(attendees_per_countries,'[{}]',''),
',',
':'
) as attendees_per_countries
FROM
raw_data
)
select
regexp_replace(cm.key,"[' ]","") as country,
SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC
Let me know if this works for you
CodePudding user response:
In presto if you have attendees_per_countries
as map you can use map_values
and then sum them with array_sum
/reduce
(I need to use later cause Athena does not support former one). If not - you can treat you data as json and cast it to MAP(VARCHAR, INTEGER)
and then use the mentioned functions:
WITH dataset(event_name, attendees_per_countries) AS (
VALUES
('a', JSON '{"US":5}'),
('b', JSON '{"US":4, "UK": 3, "CA": 2}'),
('c', JSON '{"UK":2, "CA": 1}')
)
SELECT event_name as country,
reduce(
map_values(cast(attendees_per_countries as MAP(VARCHAR, INTEGER))),
0,
(agg, curr) -> agg curr,
s -> s
) as number_of_people
FROM dataset
order by 2 desc
Output:
country | number_of_people |
---|---|
b | 9 |
a | 5 |
c | 3 |