Home > Software design >  Not able to aggregate derived column of week of month in Spark SQL
Not able to aggregate derived column of week of month in Spark SQL

Time:09-13

Below is the Spark SQL which gives the result as shown in the output

select date_part('WEEK', ts), date_format(ts,'yyyy-MM-dd') from threads where ts between
'2020-08-05 00:00:00' and '2020-09-06 00:00:00' order by date_format(ts,'yyyy-MM-dd')

Output

date_part(WEEK, ts) date_format(ts, yyyy-MM-dd)
32                   2020-08-07
32                   2020-08-07
32                   2020-08-07
33                   2020-08-12
33                   2020-08-12
33                   2020-08-12
33                   2020-08-12
36                   2020-08-31
36                   2020-08-31
36                   2020-08-31
36                   2020-08-31
36                   2020-08-31
36                   2020-09-01
36                   2020-09-02

But when I try to find the week count by grouping data

%sql

select count(date_part('WEEK', ts)), date_part('WEEK', ts), date_format(ts,'yyyy-MM-dd') from 
threads where ts between '2020-08-05 00:00:00' and '2020-09-06 00:00:00' group by 
date_part('WEEK', ts) order by date_format(ts,'yyyy-MM-dd')

I am getting the below Error,

Error in SQL statement: AnalysisException: expression 'threads.ts' is neither present in the 
group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) 
if you don't care which value you get.;
Sort [date_format(ts, yyyy-MM-dd)#2569 ASC NULLS FIRST], true
 - Aggregate [date_part(WEEK, ts#1130)], [count(date_part(WEEK, ts#1130)) AS 
count(date_part(WEEK, ts))#2567L, date_part(WEEK, ts#1130) AS date_part(WEEK, ts)#2568, 
date_format(cast(ts#1130 as timestamp), yyyy-MM-dd, Some(Etc/UTC)) AS date_format(ts, yyyy-MM- 
dd)#2569]

Any help with the above issue is highly appreciated

Expected result is

count(date_part('WEEK', ts))   date_part(WEEK, ts) date_format(ts, yyyy-MM-dd)
3                               32                   2020-08-07
4                               33                   2020-08-12
7                               36                   2020-08-31

CodePudding user response:

You need to group by start of weeks:

%sql

select 
  count(ts) cnt, 
  date_part('WEEK', date_trunc('WEEK', ts)) as week, 
  date_format(min(ts),'yyyy-MM-dd') as week_date
from threads 
where ts between '2020-08-05 00:00:00' and '2020-09-06 00:00:00'
group by date_trunc('WEEK', ts) 
order by date_trunc('WEEK', ts)

 --- ---- ---------- 
|cnt|week| week_date|
 --- ---- ---------- 
|  3|  32|2020-08-07|
|  4|  33|2020-08-12|
|  7|  36|2020-08-31|
 --- ---- ---------- 
  • Related