Home > Net >  How do I aggregate array elements column-wise in pyspark?
How do I aggregate array elements column-wise in pyspark?

Time:01-31

col1 col2
[1,2,3,4] [0,1,0,3]
[5,6,7,8] [0,3,4,8]

desired result:

col1 col2
[6,8,10,12] [0,4,4,11]

In snowflake's snowpark this is relatively straight forward using array_construct. Apache Spark has a similar array function but there is a major difference.

In snowpark, I can do array_construct(count('*'), sum(col('x')), sum(col('y'), count(col('y'))) but apache spark seems to count array() as an aggregation and complains that I can't have an aggregation inside of an aggregation.

pyspark.sql.utils.AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;

I'm trying to write a piece of code that can handle both snowpark and apache spark but this array_construct vs array is proving trickier than anticipated. Next up is to explore doing a groupby & collect_list but wondering how others have solved this?

CodePudding user response:

If you treat the aggregations as delimited strings, you can split them to an array

select a, split(concat_ws(',',count(b),sum(c)),',')
from t
group by a;

CodePudding user response:

Turns out it's not the difference between array functions at all, it was a completely different issue.

I have a translation function, that translates snowpark's regexp_count function to pyspark using the formula

    def regexp_count(self, col, regex):
        return F.sum(F.when(col.rlike(regex), 1).otherwise(0))

This causes issues later on when I did F.sum(regexp_count(col, regex)) hence the error, because as becomes obvious, there is an F.sum inside an F.sum :/

  • Related