count most repeated value per group in hive?-CodePudding

I am using hive 0.14.0 in a hortonworks data platform, on a big file similar to this input data:

tpep_pickup_datetime	pulocationid
2022-01-28 23:32:52.0	100
2022-02-28 23:02:40.0	202
2022-02-28 17:22:45.0	102
2022-02-28 23:19:37.0	102
2022-03-29 17:32:02.0	102
2022-01-28 23:32:40.0	101
2022-02-28 17:28:09.0	201
2022-03-28 23:59:54.0	100
2022-02-28 21:02:40.0	100

I want to find out what was the most common hour in each locationid, this being the result:

locationid	hour
100	23
101	17
102	17
201	17
202	23

i was thinking in using a partition command like this:

select * from (
   select hour(tpep_pickup_datetime), pulocationid
      
        (max (hour(tpep_pickup_datetime))) over (partition by pulocationid) as max_hour,
        row_number() over (partition by pulocationid) as row_no
        from yellowtaxi22
   ) res 
   where res.row_no = 1;

but it shows me this error: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: Invalid function pulocationid

is there any other way of doing this?

CodePudding user response：

with raw_groups -- subquery syntax
(
 select 
  struct( 
   count(time_stamp), -- must be first for max to use it to sort on
   location.pulocationid , 
   hour(time_stamp) as hour 
  ) as mylocation -- create a struct to make max do the work for us
 from 
  location 
 group by 
  location.pulocationid,  
  hour(time_stamp)
),
grouped_data as  -- another subquery syntax based on `with`
( 
 select 
  max(mylocation) as location -- will pick max based on count(time_stamp)
 from 
  raw_groups 
 group by 
  mylocation.pulocationid
) 
select --format data into your requested format
 location.pulocationid, 
 location.hour 
from 
 grouped_data

I do not remember hive 0.14 can use with clause, but you could easily re-write the query to not use it.(by substituting the select in pace of the table names) I just don't find it as readable:

select --format data into your requested format
 location.pulocationid, 
 location.hour 
from 
 ( 
 select 
  max(mylocation) as location -- will pick max based on count(time_stamp)
 from 
  (
 select 
  struct( 
   count(time_stamp), -- must be first for max to use it to sort on
   location.pulocationid , 
   hour(time_stamp) as hour 
  ) as mylocation -- create a struct to make max do the work for us
 from 
  location 
 group by 
  location.pulocationid,  
  hour(time_stamp)
)
 group by 
  mylocation.pulocationid
)

CodePudding user response：

You were half way there!

The idea was in the right direction however the syntax is a little bit off:

First find the count per each hour

Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22 
     Group by pulocationid, hour (tpep_pickup_datetime)

Then add the row_number but you need to order it by the total count in a descending way:

select pulocationid , hour , cnt , row_number () over ( partition be pulocationid order by cnt desc ) as row_no from
Last but not the list, take only the rows with the highest count ( this can be done by the max function rather than the row_number one by the way)

Or in total :

      select pulocationid ,  hour from (

        select  pulocationid ,  hour ,  cnt  , row_number ()
 over ( partition by pulocationid order by  cnt desc ) 
as row_no  from (
        Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22 
        Group by pulocationid, hour (tpep_pickup_datetime) ))
        Where row_no=1