I need to read the below provided lines which has comma separated values and generate a key value pair RDD as shown in output.I am new to spark any guidance is appreciated.
Input:
R-001, A1, 10, A2, 20, A3, 30
R-002, X1, 20, Y2, 10
R-003, Z4, 30, Z10, 5, N12, 38
Output:
R-001, A1
R-001, A2
R-001, A3
R-002, X1
R-002, Y2
R-003, Z4
R-003, Z10
R-003, N12
Code:
lines = spark.parallelize([
"R-001, A1, 10, A2, 20, A3, 30",
"R-002, X1, 20, Y2, 10",
"R-003, Z4, 30, Z10, 5, N12, 38"])
CodePudding user response:
You can flatMap
over the lines
RDD and for each line extract the key and value by splitting based on ,
.
from typing import Tuple, List
lines = spark.parallelize([
"R-001, A1, 10, A2, 20, A3, 30",
"R-002, X1, 20, Y2, 10",
"R-003, Z4, 30, Z10, 5, N12, 38"])
def processor(line: str) -> List[Tuple[str, str]]:
tokens = line.split(",")
key = tokens[0].strip()
return [(key, v.strip()) for v in tokens[1::2]]
lines.flatMap(processor).collect()
Output
[('R-001', 'A1'),
('R-001', 'A2'),
('R-001', 'A3'),
('R-002', 'X1'),
('R-002', 'Y2'),
('R-003', 'Z4'),
('R-003', 'Z10'),
('R-003', 'N12')]