PySpark - convert RDD to pair key value RDD-CodePudding

I created rdd from CSV lines = sc.textFile(data) now I need to convert lines to key value rdd where value where value will be string (after splitting) and key will be number of column of csv for example CSV

Col 1	Col2
73	230666
55	149610

I want to get rdd.take(1): [(1,73), (2, 230666)]

I create rdd  of lists 
lines_of_list = lines_data.map(lambda line  : line.split(','))

I create function that get list and return list of tuples (key, value)
def list_of_tuple (l):
  list_tup = []
  for i in range(len(l[0])):
    list_tup.append((l[0][i],i))
  return(list_tup)

But I can’t  get the correct result when I try to map this function on RDD

CodePudding user response：

You can use the PySpark's create_map function to do so, like so:

from pyspark.sql.functions import create_map, col, lit

df = spark.createDataFrame([(73, 230666), (55, 149610)], "Col1: int, Col2: int")
mapped_df = df.select(create_map(lit(1), col("Col1")).alias("mappedCol1"), create_map(lit(2), col("Col2")).alias("mappedCol2"))
mapped_df.show()

 ---------- ------------- 
|mappedCol1|   mappedCol2|
 ---------- ------------- 
| {1 -> 73}|{2 -> 230666}|
| {1 -> 55}|{2 -> 149610}|
 ---------- -------------

If you still want use RDD API, then its a property of DataFrame, so you can use it like so:

mapped_df.rdd.take(1)

Out[32]: [Row(mappedCol1={1: 73}, mappedCol2={2: 230666})]

CodePudding user response：

I fixed the problem in this way:

def list_of_tuple (line_rdd):
  l = line_rdd.split(',')
  list_tup = []
  for i in range(len(l)):
    list_tup.append((l[i],i))
  return(list_tup)

pairs_rdd = lines_data.map(lambda line: list_of_tuple(line))