I'm new to spark and i have some simple question. I want to use method of prefixspan but it only support for dataset and dataframe. so i do convert list to rdd and then convert it to dataframe. but why list should be converted to rdd first? why list can not be directly converted to dataframe?
data = [Row([[1, 2], [3]]), Row([[1], [3, 2], [2]]), Row([[1, 2], [5]]), Row([[6]])]
columns = ["seq"]
rdd = spark.sparkContext.parallelize(data)
df = spark.createDataFrame(data=data).toDF(*columns)
Thanks.
CodePudding user response:
No, you don't need to create an RDD first.
DataFrame is an abstraction on top of RDD. You can create a DataFrame from RDD, or directly, as mentioned here:
df = spark.createDataFrame(
[
(1, "foo"), # create your data here, be consistent in the types.
(2, "bar"),
],
["id", "label"] # add your column names here
)
Regardless of how you created the DataFrame, it will still have a .rdd
member.
CodePudding user response:
I am copying the code given in the question here.
data = [Row([[1, 2], [3]]), Row([[1], [3, 2], [2]]), Row([[1, 2], [5]])] # Line 1
columns = ["seq"] # Line 2
rdd = spark.sparkContext.parallelize(data) # Line 3
df = spark.createDataFrame(data).toDF(*columns) # Line 4 (removed .show())
Here, we are not using rdd variable anywhere so you don't actually need to even create it. Line 3 can be removed and df will still be the same. And this is the exact way we create a DataFrame directly from a list. There is one other way which you might see very frequently.
df = spark.createDataFrame(data, schema=columns)
This will also create the same DataFrame.