Home > Back-end >  how to create key value paris from rdd in pyspark
how to create key value paris from rdd in pyspark

Time:06-08

I have come across this link Spark Scala Array of String lines to pairRDD, Can someone help me with how to do it in Pyspark?

As I am new to PySpark, can someone help me understand how the below code can be written in PySpark?

rdd.map(_.split(", "))
  .flatMap(x =>  x.tail.grouped(2).map(y => (x.head, y.head)))

CodePudding user response:

While explanation of scala code you can read in that link below python code will give you same output

cq="""Row-Key-001, K1, 10, A2, 20, K3, 30, B4, 42, K5, 19, C20, 20
Row-Key-002, X1, 20, Y6, 10, Z15, 35, X16, 42
Row-Key-003, L4, 30, M10, 5, N12, 38, O14, 41, P13, 8"""
with open("path of file","w") as r:
  r.write(cq) # writing data into file
  r.close()
v=sc.textFile("path of file") # reading above data as RDD
v.map(lambda x : x.split(", ")).flatMap(lambda x:[(x[0],a) for a in x[1:]]).collect()

#output
[('Row-Key-001', 'K1'),
 ('Row-Key-001', '10'),
 ('Row-Key-001', 'A2'),
 ('Row-Key-001', '20'),
 ('Row-Key-001', 'K3'),
 ('Row-Key-001', '30'),
 ('Row-Key-001', 'B4'),
 ('Row-Key-001', '42'),
 ('Row-Key-001', 'K5'),
 ('Row-Key-001', '19'),
 ('Row-Key-001', 'C20'),
 ('Row-Key-001', '20'),
 ('Row-Key-002', 'X1'),
 ('Row-Key-002', '20'),
 ('Row-Key-002', 'Y6'),
 ('Row-Key-002', '10'),
 ('Row-Key-002', 'Z15'),
 ('Row-Key-002', '35'),
 ('Row-Key-002', 'X16'),
 ('Row-Key-002', '42'),
 ('Row-Key-003', 'L4'),
 ('Row-Key-003', '30'),
 ('Row-Key-003', 'M10'),
 ('Row-Key-003', '5'),
 ('Row-Key-003', 'N12'),
 ('Row-Key-003', '38'),
 ('Row-Key-003', 'O14'),
 ('Row-Key-003', '41'),
 ('Row-Key-003', 'P13'),
 ('Row-Key-003', '8')]

First I am creating a list for each line using split method

Then I am mapping first element of that list to all other elements and using flatmap to have singular list instead of sublists

  • Related