I have come across this link Spark Scala Array of String lines to pairRDD, Can someone help me with how to do it in Pyspark?
As I am new to PySpark, can someone help me understand how the below code can be written in PySpark?
rdd.map(_.split(", "))
.flatMap(x => x.tail.grouped(2).map(y => (x.head, y.head)))
CodePudding user response:
While explanation of scala code you can read in that link below python code will give you same output
cq="""Row-Key-001, K1, 10, A2, 20, K3, 30, B4, 42, K5, 19, C20, 20
Row-Key-002, X1, 20, Y6, 10, Z15, 35, X16, 42
Row-Key-003, L4, 30, M10, 5, N12, 38, O14, 41, P13, 8"""
with open("path of file","w") as r:
r.write(cq) # writing data into file
r.close()
v=sc.textFile("path of file") # reading above data as RDD
v.map(lambda x : x.split(", ")).flatMap(lambda x:[(x[0],a) for a in x[1:]]).collect()
#output
[('Row-Key-001', 'K1'),
('Row-Key-001', '10'),
('Row-Key-001', 'A2'),
('Row-Key-001', '20'),
('Row-Key-001', 'K3'),
('Row-Key-001', '30'),
('Row-Key-001', 'B4'),
('Row-Key-001', '42'),
('Row-Key-001', 'K5'),
('Row-Key-001', '19'),
('Row-Key-001', 'C20'),
('Row-Key-001', '20'),
('Row-Key-002', 'X1'),
('Row-Key-002', '20'),
('Row-Key-002', 'Y6'),
('Row-Key-002', '10'),
('Row-Key-002', 'Z15'),
('Row-Key-002', '35'),
('Row-Key-002', 'X16'),
('Row-Key-002', '42'),
('Row-Key-003', 'L4'),
('Row-Key-003', '30'),
('Row-Key-003', 'M10'),
('Row-Key-003', '5'),
('Row-Key-003', 'N12'),
('Row-Key-003', '38'),
('Row-Key-003', 'O14'),
('Row-Key-003', '41'),
('Row-Key-003', 'P13'),
('Row-Key-003', '8')]
First I am creating a list for each line using split method
Then I am mapping first element of that list to all other elements and using flatmap to have singular list instead of sublists