I created rdd from CSV lines = sc.textFile(data) now I need to convert lines to key value rdd where value where value will be string (after splitting) and key will be number of column of csv for example CSV
Col 1 | Col2 |
---|---|
73 | 230666 |
55 | 149610 |
I want to get rdd.take(1): [(1,73), (2, 230666)]
I create rdd of lists
lines_of_list = lines_data.map(lambda line : line.split(','))
I create function that get list and return list of tuples (key, value)
def list_of_tuple (l):
list_tup = []
for i in range(len(l[0])):
list_tup.append((l[0][i],i))
return(list_tup)
But I can’t get the correct result when I try to map this function on RDD
CodePudding user response:
You can use the PySpark's create_map
function to do so, like so:
from pyspark.sql.functions import create_map, col, lit
df = spark.createDataFrame([(73, 230666), (55, 149610)], "Col1: int, Col2: int")
mapped_df = df.select(create_map(lit(1), col("Col1")).alias("mappedCol1"), create_map(lit(2), col("Col2")).alias("mappedCol2"))
mapped_df.show()
---------- -------------
|mappedCol1| mappedCol2|
---------- -------------
| {1 -> 73}|{2 -> 230666}|
| {1 -> 55}|{2 -> 149610}|
---------- -------------
If you still want use RDD API, then its a property of DataFrame, so you can use it like so:
mapped_df.rdd.take(1)
Out[32]: [Row(mappedCol1={1: 73}, mappedCol2={2: 230666})]
CodePudding user response:
I fixed the problem in this way:
def list_of_tuple (line_rdd):
l = line_rdd.split(',')
list_tup = []
for i in range(len(l)):
list_tup.append((l[i],i))
return(list_tup)
pairs_rdd = lines_data.map(lambda line: list_of_tuple(line))