Error in Spark sql : AnalysisException: Correlated scalar subquery-CodePudding

I am creating a spark dataframe using the below query:

select distinct  adr.ProductionDate,adr.CostCenterKey,adr.AEMainCategoryKey, Sum(adr.DurationDayFrac*adr.FixedCashCostAE)*100 / (select sum(adr_sub.DurationDayFrac*adr_sub.FixedCashCostAE) 
from hive_metastore.oth_opsdbconf.operations_db___asset_effectiveness_daily_reporting  adr_sub
where adr_sub.costcenterkey=adr.costcenterkey and adr_sub.ProductionDate=adr.ProductionDate
) as AELossMagnitude 
FROM hive_metastore.oth_opsdbconf.operations_db___asset_effectiveness_daily_reporting  adr 
WHERE adr.DurationDayFrac != 0 AND adr.ProductionDate <= Current_Date() 
AND adr.ProductionDate >= '2017-01-01'
AND adr.SiteName in ('Ludwigshafen','Schwarzheide','Antwerpen') 
AND AreaIsCurrent = 'true' 
group by adr.ProductionDate,adr.CostCenterKey,adr.AEMainCategoryKey
order by adr.ProductionDate,adr.CostCenterKey,adr.AEMainCategoryKey

But i am getting the below error:

AnalysisException: Correlated scalar subquery 'scalarsubquery(adr.costcenterkey, adr.ProductionDate)' is neither present in the group by, nor in an aggregate function. Add it to group by using ordinal position or wrap it in first() (or first_value) if you don't care which value you get.;

Request you to please help me with the correct syntax.

I am expecting the correct syntax to remove the error.z Its asking for group by in the sub query which is not possible.

CodePudding user response：

First, I don't understand why you need the distinct key word. I think the correlated sub-query will not work. Use a common table expression instead. Please see code below.

--
--  Use common table expression instead of correlated sub query
--

WITH cte_cost_by_center_n_date
(
SELECT
   costcenterkey,
   ProductionDate,
   sum(DurationDayFrac * FixedCashCostAE) as TotalCost

FROM
    hive_metastore.oth_opsdbconf.operations_db___asset_effectiveness_daily_reporting 
GROUP BY
   costcenterkey,
   ProductionDate
)

SELECT
    adr.ProductionDate,
    adr.CostCenterKey,
    adr.AEMainCategoryKey, 
    sum(adr.DurationDayFrac*adr.FixedCashCostAE) * 100 /  adr_sub.TotalCost as AELossMagnitude 
FROM 
    hive_metastore.oth_opsdbconf.operations_db___asset_effectiveness_daily_reporting as adr 
JOIN
    cte_cost_by_center_n_date as adr_sub
ON 
    adr_sub.costcenterkey = adr.costcenterkey 
    AND adr_sub.ProductionDate = adr.ProductionDate
WHERE 
    adr.DurationDayFrac != 0 
    AND adr.ProductionDate <= Current_Date() 
    AND adr.ProductionDate >= '2017-01-01'
    AND adr.SiteName in ('Ludwigshafen','Schwarzheide','Antwerpen') 
    AND adr.AreaIsCurrent = 'true' 
GROUP BY
    adr.ProductionDate,
    adr.CostCenterKey,
    adr.AEMainCategoryKey
ORDER BY 
    adr.ProductionDate,
    adr.CostCenterKey,
    adr.AEMainCategoryKey

CodePudding user response：

Here is a data set that I have locally, Adventure works. Let's figure out sales at the company level regardless of region.

select 
    d.CalendarYear,
    d.CalendarQuarter,
    sum(s.SalesAmount) as CompanyAmt
from 
    star.fact_internet_sales as s
join 
    star.dim_date as d
on 
    s.OrderDateKey = d.DateKey
where 
    d.CalendarYear = 2013
group by
    d.CalendarYear,
    d.CalendarQuarter
order by
    d.CalendarYear,
    d.CalendarQuarter

Next, lets figure out sales are the regional level (territory).

select 
    d.CalendarYear,
    d.CalendarQuarter,
    s.SalesTerritoryKey,
    sum(s.SalesAmount) as RegionAmt
from 
    star.fact_internet_sales as s
join 
    star.dim_date as d
on 
    s.OrderDateKey = d.DateKey
where 
    d.CalendarYear = 2013
group by
    d.CalendarYear,
    d.CalendarQuarter,
    s.SalesTerritoryKey
order by
    d.CalendarYear,
    d.CalendarQuarter,
    s.SalesTerritoryKey

Last, use the CTE to bring in a total so that we can figure out what percentage does each territory contribute to sales for a given quarter.

-- Sales per qtr at company level
with cte_whole_company as
(
select 
    d.CalendarYear,
    d.CalendarQuarter,
    sum(s.SalesAmount) as CompanyAmt
from 
    star.fact_internet_sales as s
join 
    star.dim_date as d
on 
    s.OrderDateKey = d.DateKey
where 
    d.CalendarYear = 2013
group by
    d.CalendarYear,
    d.CalendarQuarter
order by
    d.CalendarYear,
    d.CalendarQuarter
)
-- Aggregated sales at territory with pct/contribution
select 
    d.CalendarYear,
    d.CalendarQuarter,
    s.SalesTerritoryKey,
    sum(s.SalesAmount) as RegionAmt,
    first(c.CompanyAmt) as CompanyAmt,
    sum(s.SalesAmount) / first(c.CompanyAmt) * 100 as PctAmt
from 
    star.fact_internet_sales as s
join 
    star.dim_date as d
on 
    s.OrderDateKey = d.DateKey
join
    cte_whole_company as c
on 
    d.CalendarYear = c.CalendarYear and
    d.CalendarQuarter = c.CalendarQuarter
where 
    d.CalendarYear = 2013
group by
    d.CalendarYear,
    d.CalendarQuarter,
    s.SalesTerritoryKey
order by
    d.CalendarYear,
    d.CalendarQuarter,
    s.SalesTerritoryKey

My original statement that sub-queries can not be used in aggregations is true. Yes, you might be able to create a derived query or even create a temporary view, but I like common table expressions.