col1 | col2 |
---|---|
[1,2,3,4] | [0,1,0,3] |
[5,6,7,8] | [0,3,4,8] |
desired result:
col1 | col2 |
---|---|
[6,8,10,12] | [0,4,4,11] |
In snowflake's snowpark this is relatively straight forward using array_construct. Apache Spark has a similar array function but there is a major difference.
In snowpark, I can do array_construct(count('*'), sum(col('x')), sum(col('y'), count(col('y')))
but apache spark seems to count array()
as an aggregation and complains that I can't have an aggregation inside of an aggregation.
pyspark.sql.utils.AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;
I'm trying to write a piece of code that can handle both snowpark and apache spark but this array_construct
vs array
is proving trickier than anticipated. Next up is to explore doing a groupby & collect_list but wondering how others have solved this?
CodePudding user response:
If you treat the aggregations as delimited strings, you can split them to an array
select a, split(concat_ws(',',count(b),sum(c)),',')
from t
group by a;
CodePudding user response:
Turns out it's not the difference between array functions at all, it was a completely different issue.
I have a translation function, that translates snowpark's regexp_count
function to pyspark using the formula
def regexp_count(self, col, regex):
return F.sum(F.when(col.rlike(regex), 1).otherwise(0))
This causes issues later on when I did F.sum(regexp_count(col, regex)) hence the error, because as becomes obvious, there is an F.sum
inside an F.sum
:/