Why do I get a naming convention error in PySpark when the name is correct?-CodePudding

I'm trying to groupBy a variable (column) called saleId, and then get the Sum for it, using an attribute (column) called totalAmount with the code below:

df = df.groupBy('saleId').agg({"totalAmount": "sum"})

But I get the following error:

Attribute sum(totalAmount) contains an invalid character among ,;{}()\n\t=. Please use an alias to rename it

I'm assuming there's something wrong with the way I'm using groupBy, because I get other errors even when I try the following code instead of the above one:

df = df.groupBy('saleId').sum('totalAmount')

What's the problem with my code?

CodePudding user response：

OK, I figured out what went wrong.

The code I used in my question, returns the whole sum(totalAmount) as the name of the variable (column), which as you can see includes parenthesis.

This can be avoided by using:

df= df.groupBy('saleId').agg({"totalAmount": "sum"}).withColumnRenamed('sum(totalAmount)', 'totalAmount')

df.groupBy('saleId').agg(F.sum("totalAmount").alias(totalAmount))