Input:
name | address | floor | resources | |
---|---|---|---|---|
A | Bangalore | [email protected] | 1 | CPU |
A | Bangalore | [email protected] | 1 | CPU |
A | Bangalore | [email protected] | 2 | DESKTOP |
B | Bangalore | [email protected] | 2 | DESKTOP |
B | Bangalore | [email protected] | 2 | DESKTOP |
B | Bangalore | [email protected] | 1 | MONITIOR |
Desired output:
name | total visits | most visited floor | resources used |
---|---|---|---|
A | 3 | 1 | CPU, ,DESKTOP |
B | 3 | 2 | DESKTOP,MONITIOR |
so I came up with this code and approach using spark-sql but I am also ok if anyone is able to answer it in ms-sql or sql-server anything is fine
select name, concat_ws(',', collect_set(resources)) as resources_used, count(*) as total_visits
from resources_table
group by name
I am not able to compute the most_visited_floor column to get the desired output.
Appreciate the help
CodePudding user response:
Try this:
val df = Seq(
( "A", "Bangalore", "a*.com", 1, "cpu" ),
( "A", "Bangalore", "a*.com", 1, "cpu" ),
( "A", "Bangalore", "a*.com", 2, "desktop" ),
( "B", "Bangalore", "a*.com", 2, "desktop" ),
( "B", "Bangalore", "a*.com", 2, "desktop" ),
( "B", "Bangalore", "a*.com", 1, "monitor" ),
).toDF("name" ,"address", "email", "floor", "resource")
df.createOrReplaceTempView("R")
val res = spark.sql("""
select A.name, A.total_visits, B.floor, C.resources from (
select R.name, count(*) as total_visits
from R
group by R.name ) A,
(
select Z.name, Z.floor, Z.most_visited
from (
select X.*, rank() over (partition by X.name order by X.most_visited desc) as RANK
from (
select R.name, R.floor, count(R.floor) as most_visited
from R
group by R.name, R.floor) X ) Z
where Z.RANK = 1 ) B,
(
select R.name, array_sort(collect_set(resource)) as resources
from R
group by R.name ) C
where A.name = B.name and B.name = C.name
""")
res.show(false)
It returns:
---- ------------ ----- ------------------
|name|total_visits|floor|resources |
---- ------------ ----- ------------------
|A |3 |1 |[cpu, desktop] |
|B |3 |2 |[desktop, monitor]|
---- ------------ ----- ------------------
CodePudding user response:
What you are looking for is called in statistics Mode.
Search for Mode SQL and you'll find endless blogs and posts.
There are multiple ways to get the Mode.
Here is one option, assuming there is a single Mode value:
with
t (name,address,email,floor,resources) as
(
select *
from values ('A' ,'Bangalore' ,'[email protected]' ,1 ,'CPU' )
,('A' ,'Bangalore' ,'[email protected]' ,1 ,'CPU' )
,('A' ,'Bangalore' ,'[email protected]' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'[email protected]' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'[email protected]' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'[email protected]' ,1 ,'MONITIOR')
),
t1 as
(
select * ,count(*) over (partition by name, floor) as count_name_floor
from t
)
select name
,count(*) as total_visitsA
,max((count_name_floor,floor)).floor as most_visited_floor
,concat_ws(',',collect_set(resources)) as resources_used
from t1
group by name
name | total_visits | most_visited_floor | resources_used |
---|---|---|---|
B | 3 | 2 | MONITIOR,DESKTOP |
A | 3 | 1 | DESKTOP,CPU |
Here is another option assuming there might be multiple Mode values.
I added 2 rows to the input, to make it more interesting.
with
t (name,address,email,floor,resources) as
(
select *
from values ('A' ,'Bangalore' ,'[email protected]' ,1 ,'CPU' )
,('A' ,'Bangalore' ,'[email protected]' ,1 ,'CPU' )
,('A' ,'Bangalore' ,'[email protected]' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'[email protected]' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'[email protected]' ,2 ,'DESKTOP' )
,('B' ,'Bangalore' ,'[email protected]' ,1 ,'MONITIOR')
,('B' ,'Bangalore' ,'[email protected]' ,1 ,'MONITIOR')
,('B' ,'Bangalore' ,'[email protected]' ,3 ,'MONITIOR')
),
t1 as
(
select * ,count(*) over (partition by name, floor) as count_name_floor
from t
),
t2 as
(
select * ,rank() over (partition by name order by count_name_floor desc) as rank_count_name_floor
from t1
)
select name
,count(*) as total_visitsA
,concat_ws(',',collect_set(case rank_count_name_floor when 1 then floor end)) as most_visited_floors
,concat_ws(',',collect_set(resources)) as resources_used
from t2
group by name
name | total_visitsA | most_visited_floors | resources_used |
---|---|---|---|
A | 3 | 1 | DESKTOP,CPU |
B | 5 | 1,2 | MONITIOR,DESKTOP |