pyspark generate key value pairs RDD from comma separated lines-CodePudding

I need to read the below provided lines which has comma separated values and generate a key value pair RDD as shown in output.I am new to spark any guidance is appreciated.

Input:

    R-001, A1, 10, A2, 20, A3, 30

    R-002, X1, 20, Y2, 10

    R-003, Z4, 30, Z10, 5, N12, 38

Output:

    R-001, A1
    R-001, A2
    R-001, A3
    R-002, X1
    R-002, Y2
    R-003, Z4
    R-003, Z10
    R-003, N12

Code:

    lines = spark.parallelize([
    "R-001, A1, 10, A2, 20, A3, 30",
    "R-002, X1, 20, Y2, 10",
    "R-003, Z4, 30, Z10, 5, N12, 38"])

CodePudding user response：

You can flatMap over the lines RDD and for each line extract the key and value by splitting based on ,.

from typing import Tuple, List

lines = spark.parallelize([
    "R-001, A1, 10, A2, 20, A3, 30",
    "R-002, X1, 20, Y2, 10",
    "R-003, Z4, 30, Z10, 5, N12, 38"])

def processor(line: str) -> List[Tuple[str, str]]:
    tokens = line.split(",")
    key = tokens[0].strip()
    return [(key, v.strip()) for v in tokens[1::2]]

lines.flatMap(processor).collect()

Output

[('R-001', 'A1'),
 ('R-001', 'A2'),
 ('R-001', 'A3'),
 ('R-002', 'X1'),
 ('R-002', 'Y2'),
 ('R-003', 'Z4'),
 ('R-003', 'Z10'),
 ('R-003', 'N12')]