How do I use rdd.map/groupByKey with multiple values?-CodePudding

I'm currently learning pyspark and I don't quite understand the syntax to using multiple values with groupByKey(). Assuming I wanted to find the category of the product with the highest quantity in a schema of (Category, Quantity) it seems straightforward enough. I could do:

itemQuantity = df.rdd.map(lambda x: (x[0],x[1]))
highestQuantity = itemQuantity.groupByKey()
                  .map(lambda x: (x[0], sum(x[1])))
                  .top(1, key = lambda x: x[1])

But assuming I had a schema of (Category, Price, Quantity) and had to find the average item cost in each category, how would I do it? Conceptually I understand that I would first have to groupByKey() to achieve something like (Category, [Price, Quantity]), because groupByKey() doesn't appear to work with 3 values (ValueError: too many values to unpack (expected 2). When I do:

averageCost = df.rdd.map(lambda x: (x[0], (x[1], x[2])))
aC = averageCost.groupByKey()

I end up with a dataframe that I'm unsure how to manipulate now that it has two tuples in it.

[('Category_1', <pyspark.resultiterable.ResultIterable object at 0x7fb812b5ef70>), ... ,('Category_n', <pyspark.resultiterable.ResultIterable object at 0x7fb812b5fb50>)]

Ideally I want to have a result of something like (Category, sum(Price), sum(Quantity)). How would I achieve this? I have been using reduceByKey() on two separate schemas (Category, Price) and (Category, Quantity) instead, although that's far from ideal. Also are there any cheatsheets you would recommend for a beginner to understand the syntax? I've looked through a lot of examples and documentation, but I find them often too abstract to understand.

CodePudding user response：

Is there any specific reason you want to use rdd functions instead of dataframe one as in your case it will be easier to write like

from pyspark.sql.functions import sum
 df.groupBy("Category").agg(sum("Price"),sum("Quantity"))

CodePudding user response：

It seems like

priceAndQuantity = df.rdd.map(lambda x: (x[0], (x[1], x[2])))
PAQ1 = priceAndQuantity.reduceByKey(lambda v1,v2: (v1[0]   v2[0], v1[1]   v2[1]))
PAQ2 = PAQ1.map(lambda x: (x[0], x[1][0], x[1][1]))

does what I was hoping to do.