Why list should be converted to RDD and then Dataframe? is there any method to convert list to dataf-CodePudding

I'm new to spark and i have some simple question. I want to use method of prefixspan but it only support for dataset and dataframe. so i do convert list to rdd and then convert it to dataframe. but why list should be converted to rdd first? why list can not be directly converted to dataframe?

data = [Row([[1, 2], [3]]), Row([[1], [3, 2], [2]]), Row([[1, 2], [5]]), Row([[6]])]
columns = ["seq"]
rdd = spark.sparkContext.parallelize(data)
df = spark.createDataFrame(data=data).toDF(*columns)

Thanks.

CodePudding user response：

No, you don't need to create an RDD first.

DataFrame is an abstraction on top of RDD. You can create a DataFrame from RDD, or directly, as mentioned here:

df = spark.createDataFrame(
    [
        (1, "foo"),  # create your data here, be consistent in the types.
        (2, "bar"),
    ],
    ["id", "label"]  # add your column names here
)

Regardless of how you created the DataFrame, it will still have a .rdd member.

CodePudding user response：

I am copying the code given in the question here.

data = [Row([[1, 2], [3]]), Row([[1], [3, 2], [2]]), Row([[1, 2], [5]])]    # Line 1
columns = ["seq"]                                                           # Line 2
rdd = spark.sparkContext.parallelize(data)                                  # Line 3
df = spark.createDataFrame(data).toDF(*columns)                             # Line 4 (removed .show())

Here, we are not using rdd variable anywhere so you don't actually need to even create it. Line 3 can be removed and df will still be the same. And this is the exact way we create a DataFrame directly from a list. There is one other way which you might see very frequently.

df = spark.createDataFrame(data, schema=columns)

This will also create the same DataFrame.