Need help trying to get the desired output from this input, get total_visits, most_visited

Input:

name	address	email	floor	resources
A	Bangalore	[email protected]	1	CPU
A	Bangalore	[email protected]	1	CPU
A	Bangalore	[email protected]	2	DESKTOP
B	Bangalore	[email protected]	2	DESKTOP
B	Bangalore	[email protected]	2	DESKTOP
B	Bangalore	[email protected]	1	MONITIOR

Desired output:

name	total visits	most visited floor	resources used
A	3	1	CPU, ,DESKTOP
B	3	2	DESKTOP,MONITIOR

so I came up with this code and approach using spark-sql but I am also ok if anyone is able to answer it in ms-sql or sql-server anything is fine

select name, concat_ws(',', collect_set(resources)) as resources_used, count(*) as total_visits 
from resources_table 
group by name

I am not able to compute the most_visited_floor column to get the desired output.

Appreciate the help

CodePudding user response：

Try this:

val df = Seq( 
              ( "A", "Bangalore", "a*.com", 1, "cpu" ),
              ( "A", "Bangalore", "a*.com", 1, "cpu" ),
              ( "A", "Bangalore", "a*.com", 2, "desktop" ),
              ( "B", "Bangalore", "a*.com", 2, "desktop" ),
              ( "B", "Bangalore", "a*.com", 2, "desktop" ),
              ( "B", "Bangalore", "a*.com", 1, "monitor" ),
             ).toDF("name" ,"address", "email", "floor", "resource")

df.createOrReplaceTempView("R")

val res = spark.sql(""" 

                      select A.name, A.total_visits, B.floor, C.resources from (  
                        select R.name, count(*) as total_visits 
                          from R
                      group by R.name  ) A,
               
                        (
                        select Z.name, Z.floor, Z.most_visited
                          from (
                        select X.*, rank() over (partition by X.name order by X.most_visited desc) as RANK
                          from (
                                select R.name, R.floor, count(R.floor) as most_visited 
                                  from R
                              group by R.name, R.floor) X ) Z     
                        where Z.RANK = 1 ) B, 

                        (
                        select R.name, array_sort(collect_set(resource)) as resources 
                          from R
                      group by R.name ) C
                    where A.name = B.name and B.name = C.name
                         
                    """)
res.show(false)

It returns:

 ---- ------------ ----- ------------------ 
|name|total_visits|floor|resources         |
 ---- ------------ ----- ------------------ 
|A   |3           |1    |[cpu, desktop]    |
|B   |3           |2    |[desktop, monitor]|
 ---- ------------ ----- ------------------

CodePudding user response：

What you are looking for is called in statistics Mode.
Search for Mode SQL and you'll find endless blogs and posts.

There are multiple ways to get the Mode.

Here is one option, assuming there is a single Mode value:

with 
t (name,address,email,floor,resources) as
(
    select  *
    from    values   ('A' ,'Bangalore' ,'[email protected]'  ,1  ,'CPU'     )
                    ,('A' ,'Bangalore' ,'[email protected]'  ,1  ,'CPU'     )
                    ,('A' ,'Bangalore' ,'[email protected]'  ,2  ,'DESKTOP' )
                    ,('B' ,'Bangalore' ,'[email protected]' ,2  ,'DESKTOP' )
                    ,('B' ,'Bangalore' ,'[email protected]' ,2  ,'DESKTOP' )
                    ,('B' ,'Bangalore' ,'[email protected]' ,1  ,'MONITIOR')
),
t1 as
(
  select  * ,count(*) over (partition by name, floor) as count_name_floor
  from    t
)
select   name
        ,count(*)                              as total_visitsA
        ,max((count_name_floor,floor)).floor   as most_visited_floor
        ,concat_ws(',',collect_set(resources)) as resources_used
from     t1
group by name

name	total_visits	most_visited_floor	resources_used
B	3	2	MONITIOR,DESKTOP
A	3	1	DESKTOP,CPU

Here is another option assuming there might be multiple Mode values.
I added 2 rows to the input, to make it more interesting.

with 
t (name,address,email,floor,resources) as
(
    select  *
    from    values   ('A' ,'Bangalore' ,'[email protected]'  ,1  ,'CPU'     )
                    ,('A' ,'Bangalore' ,'[email protected]'  ,1  ,'CPU'     )
                    ,('A' ,'Bangalore' ,'[email protected]'  ,2  ,'DESKTOP' )
                    ,('B' ,'Bangalore' ,'[email protected]' ,2  ,'DESKTOP' )
                    ,('B' ,'Bangalore' ,'[email protected]' ,2  ,'DESKTOP' )
                    ,('B' ,'Bangalore' ,'[email protected]' ,1  ,'MONITIOR')
                    ,('B' ,'Bangalore' ,'[email protected]' ,1  ,'MONITIOR')
                    ,('B' ,'Bangalore' ,'[email protected]' ,3  ,'MONITIOR')
),
t1 as
(
  select  * ,count(*) over (partition by name, floor) as count_name_floor
  from    t
),
t2 as
(
  select  * ,rank() over (partition by name order by count_name_floor desc) as rank_count_name_floor
  from    t1
)
select   name
        ,count(*)                                                                      as total_visitsA
        ,concat_ws(',',collect_set(case rank_count_name_floor when 1 then floor end))  as most_visited_floors
        ,concat_ws(',',collect_set(resources))                                         as resources_used
from     t2
group by name

name	total_visitsA	most_visited_floors	resources_used
A	3	1	DESKTOP,CPU
B	5	1,2	MONITIOR,DESKTOP