my rdd data is like that:
['101 We no longer have to go to a store during limited hours , stalk the aisles
looking for a product , and then wait in check-out lines',
'102 Now with the click of a button, we have the freedom to shop for anything ,
anywhere , and at any time . ',
'103 Every day is Christmas if you buy yourself stuff online .']
and I wanna make a associated table like that: The key is every word in the string respectively and the value is the number of each string.
So after mapping I hope to get the result like that:
[(We,101),(no,101),(longer,101),(have,101),(to,101),(go,101),.........,(lines,101),
(Now,102),(with,102),.................................................,(time,102)
]
But I don't know how to map to get that result,
The way I tried:
associateRDD = rdd.map(lambda line:(line.split(" "),line.split(" ")[0]))
The result I get now is like:
[(['We','no','longer','have','to','go','to','a','store','during','limited','hours',',','stalk','the','aisles'],'101'),
(['Now','with',........,'time'],'102'),(['Every',.....'online'],'103')]
I don't know how to read the element in the list respectively.
Could someone help me plz. Thanks.
CodePudding user response:
Assume your table is called table
, you can get your desired output through the following statements:
table.map(str =>
(
// get everything after the first whitespace (the text)
str.substring(str.indexOf(' ') 1).split(" "),
// get the first number
str.split(" ")(0)
)) // the final output of this part will be (101, [We, no, longer, have, to, ...])
.flatMap(data => {
data._1.map(elem => (elem, data._2)) // map the above to (101, We), (101, no), (101, longer), etc.
}) // flatMap because we do not want an Array of Arrays, but only one Array with everything inside
The final output is an Array[(String, String)]
which is what you need, and if you do a foreach
at the bottom, you get:
(We,101)
(Every,103)
(day,103)
(is,103)
(Christmas,103)
(if,103)
(you,103)
...
Good luck!
PySpark version
dummyData = [
"101 We no longer have to go to a store during limited hours , stalk the aisles looking for a product , and then wait in check-out lines",
"102 Now with the click of a button, we have the freedom to shop for anything , anywhere , and at any time . ",
"103 Every day is Christmas if you buy yourself stuff online ."]
data = sc.parallelize(dummyData)
table = data.map(lambda str: (
str.split(" ", 1)[1].split(" "),
str.split(" ")[0]
)).flatMap(lambda data: (
map(lambda elem: (elem, data[1]), data[0])
))