Home > OS >  How do I use rdd.map/groupByKey with multiple values?
How do I use rdd.map/groupByKey with multiple values?

Time:06-04

I'm currently learning pyspark and I don't quite understand the syntax to using multiple values with groupByKey(). Assuming I wanted to find the category of the product with the highest quantity in a schema of (Category, Quantity) it seems straightforward enough. I could do:

itemQuantity = df.rdd.map(lambda x: (x[0],x[1]))
highestQuantity = itemQuantity.groupByKey()
                  .map(lambda x: (x[0], sum(x[1])))
                  .top(1, key = lambda x: x[1])

But assuming I had a schema of (Category, Price, Quantity) and had to find the average item cost in each category, how would I do it? Conceptually I understand that I would first have to groupByKey() to achieve something like (Category, [Price, Quantity]), because groupByKey() doesn't appear to work with 3 values (ValueError: too many values to unpack (expected 2). When I do:

averageCost = df.rdd.map(lambda x: (x[0], (x[1], x[2])))
aC = averageCost.groupByKey()

I end up with a dataframe that I'm unsure how to manipulate now that it has two tuples in it.

[('Category_1', <pyspark.resultiterable.ResultIterable object at 0x7fb812b5ef70>), ... ,('Category_n', <pyspark.resultiterable.ResultIterable object at 0x7fb812b5fb50>)]

Ideally I want to have a result of something like (Category, sum(Price), sum(Quantity)). How would I achieve this? I have been using reduceByKey() on two separate schemas (Category, Price) and (Category, Quantity) instead, although that's far from ideal. Also are there any cheatsheets you would recommend for a beginner to understand the syntax? I've looked through a lot of examples and documentation, but I find them often too abstract to understand.

CodePudding user response:

Is there any specific reason you want to use rdd functions instead of dataframe one as in your case it will be easier to write like

from pyspark.sql.functions import sum
 df.groupBy("Category").agg(sum("Price"),sum("Quantity"))

CodePudding user response:

It seems like

priceAndQuantity = df.rdd.map(lambda x: (x[0], (x[1], x[2])))
PAQ1 = priceAndQuantity.reduceByKey(lambda v1,v2: (v1[0]   v2[0], v1[1]   v2[1]))
PAQ2 = PAQ1.map(lambda x: (x[0], x[1][0], x[1][1]))

does what I was hoping to do.

  • Related