How to convert the following dictionary format column into different format in Hive or Presto?-CodePudding

I have a Hive table as below:

event_name	attendees_per_countries
a	{'US':5}
b	{'US':4, 'UK': 3, 'CA': 2}
c	{'UK':2, 'CA': 1}

And I want to get a new table like below:

country	number_of_people
US	9
UK	5
CA	4

How to write a query in Hive or Presto?

CodePudding user response：

You may use the following:

If the column type for attendees_per_countries is a string, you may use the following:

WITH sample_data AS (
    select 
        event_name, 
        str_to_map(
            regexp_replace(attendees_per_countries,'[{|}]',''),
            ',',
            ':'
        ) as attendees_per_countries 
    FROM
        raw_data
        
)
select 
    regexp_replace(cm.key,"[' ]","") as country,
    SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC

However, if the column type for attendees_per_countries is already a map then you may use the following

select 
    regexp_replace(cm.key,"[' ]","") as country,
    SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC

Full reproducible example below

with raw_data AS (
    select 'a' as event_name, "{'US':5}" as attendees_per_countries
    UNION ALL 
    select 'b', "{'US':4, 'UK': 3, 'CA': 2}"
    UNION ALL 
    select 'c', "{'UK':2, 'CA': 1}"
),
sample_data AS (
    select 
        event_name, 
        str_to_map(
            regexp_replace(attendees_per_countries,'[{}]',''),
            ',',
            ':'
        ) as attendees_per_countries 
    FROM
        raw_data
        
)
select 
    regexp_replace(cm.key,"[' ]","") as country,
    SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC

Let me know if this works for you

CodePudding user response：

In presto if you have attendees_per_countries as map you can use map_values and then sum them with array_sum/reduce (I need to use later cause Athena does not support former one). If not - you can treat you data as json and cast it to MAP(VARCHAR, INTEGER) and then use the mentioned functions:

WITH dataset(event_name, attendees_per_countries) AS (
   VALUES 
('a',   JSON '{"US":5}'),
('b',   JSON '{"US":4, "UK": 3, "CA": 2}'),
('c',   JSON '{"UK":2, "CA": 1}')
 ) 
 
SELECT event_name as country,
       reduce(
               map_values(cast(attendees_per_countries as MAP(VARCHAR, INTEGER))),
               0,
               (agg, curr) -> agg   curr,
               s -> s
           )      as number_of_people
FROM dataset
order by 2 desc

Output:

country	number_of_people
b	9
a	5
c	3