Home > OS >  Spark how to map
Spark how to map

Time:12-03

my rdd data is like that:

['101 We no longer have to go to a store during limited hours , stalk the aisles
looking for a product , and then wait in check-out lines',
'102 Now with the click of a button, we have the freedom to shop for anything ,
anywhere , and at any time . ',
'103 Every day is Christmas if you buy yourself stuff online .']

and I wanna make a associated table like that: The key is every word in the string respectively and the value is the number of each string.

So after mapping I hope to get the result like that:

[(We,101),(no,101),(longer,101),(have,101),(to,101),(go,101),.........,(lines,101),
 (Now,102),(with,102),.................................................,(time,102)
]

But I don't know how to map to get that result,

The way I tried:

associateRDD = rdd.map(lambda line:(line.split(" "),line.split(" ")[0]))

The result I get now is like:

[(['We','no','longer','have','to','go','to','a','store','during','limited','hours',',','stalk','the','aisles'],'101'),
(['Now','with',........,'time'],'102'),(['Every',.....'online'],'103')]

I don't know how to read the element in the list respectively.

Could someone help me plz. Thanks.

CodePudding user response:

Assume your table is called table, you can get your desired output through the following statements:

table.map(str =>
  (
    // get everything after the first whitespace (the text)
    str.substring(str.indexOf(' ')   1).split(" "),
    // get the first number
    str.split(" ")(0)
  )) // the final output of this part will be (101, [We, no, longer, have, to, ...])
  .flatMap(data => {
    data._1.map(elem => (elem, data._2)) // map the above to (101, We), (101, no), (101, longer), etc.
  }) // flatMap because we do not want an Array of Arrays, but only one Array with everything inside

The final output is an Array[(String, String)] which is what you need, and if you do a foreach at the bottom, you get:

(We,101)
(Every,103)
(day,103)
(is,103)
(Christmas,103)
(if,103)
(you,103)
...

Good luck!

PySpark version

dummyData = [
    "101 We no longer have to go to a store during limited hours , stalk the aisles looking for a product , and then wait in check-out lines",
    "102 Now with the click of a button, we have the freedom to shop for anything , anywhere , and at any time . ",
    "103 Every day is Christmas if you buy yourself stuff online ."]
data = sc.parallelize(dummyData)

table = data.map(lambda str: (
    str.split(" ", 1)[1].split(" "),
    str.split(" ")[0]
)).flatMap(lambda data: (
    map(lambda elem: (elem, data[1]), data[0])
))
  • Related