Home > Enterprise >  How to convert the following dictionary format column into different format in Hive or Presto?
How to convert the following dictionary format column into different format in Hive or Presto?

Time:09-24

I have a Hive table as below:

event_name attendees_per_countries
a {'US':5}
b {'US':4, 'UK': 3, 'CA': 2}
c {'UK':2, 'CA': 1}

And I want to get a new table like below:

country number_of_people
US 9
UK 5
CA 4

How to write a query in Hive or Presto?

CodePudding user response:

You may use the following:

If the column type for attendees_per_countries is a string, you may use the following:

WITH sample_data AS (
    select 
        event_name, 
        str_to_map(
            regexp_replace(attendees_per_countries,'[{|}]',''),
            ',',
            ':'
        ) as attendees_per_countries 
    FROM
        raw_data
        
)
select 
    regexp_replace(cm.key,"[' ]","") as country,
    SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC

However, if the column type for attendees_per_countries is already a map then you may use the following

select 
    regexp_replace(cm.key,"[' ]","") as country,
    SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC

Full reproducible example below

with raw_data AS (
    select 'a' as event_name, "{'US':5}" as attendees_per_countries
    UNION ALL 
    select 'b', "{'US':4, 'UK': 3, 'CA': 2}"
    UNION ALL 
    select 'c', "{'UK':2, 'CA': 1}"
),
sample_data AS (
    select 
        event_name, 
        str_to_map(
            regexp_replace(attendees_per_countries,'[{}]',''),
            ',',
            ':'
        ) as attendees_per_countries 
    FROM
        raw_data
        
)
select 
    regexp_replace(cm.key,"[' ]","") as country,
    SUM(cm.value) as no_of_people
from sample_data
lateral view explode(attendees_per_countries) cm
GROUP BY regexp_replace(cm.key,"[' ]","")
ORDER BY no_of_people DESC

Let me know if this works for you

CodePudding user response:

In presto if you have attendees_per_countries as map you can use map_values and then sum them with array_sum/reduce (I need to use later cause Athena does not support former one). If not - you can treat you data as json and cast it to MAP(VARCHAR, INTEGER) and then use the mentioned functions:

WITH dataset(event_name, attendees_per_countries) AS (
   VALUES 
('a',   JSON '{"US":5}'),
('b',   JSON '{"US":4, "UK": 3, "CA": 2}'),
('c',   JSON '{"UK":2, "CA": 1}')
 ) 
 
SELECT event_name as country,
       reduce(
               map_values(cast(attendees_per_countries as MAP(VARCHAR, INTEGER))),
               0,
               (agg, curr) -> agg   curr,
               s -> s
           )      as number_of_people
FROM dataset
order by 2 desc

Output:

country number_of_people
b 9
a 5
c 3
  • Related