Home > Blockchain >  Why list should be converted to RDD and then Dataframe? is there any method to convert list to dataf
Why list should be converted to RDD and then Dataframe? is there any method to convert list to dataf

Time:09-29

I'm new to spark and i have some simple question. I want to use method of prefixspan but it only support for dataset and dataframe. so i do convert list to rdd and then convert it to dataframe. but why list should be converted to rdd first? why list can not be directly converted to dataframe?

data = [Row([[1, 2], [3]]), Row([[1], [3, 2], [2]]), Row([[1, 2], [5]]), Row([[6]])]
columns = ["seq"]
rdd = spark.sparkContext.parallelize(data)
df = spark.createDataFrame(data=data).toDF(*columns)

Thanks.

CodePudding user response:

No, you don't need to create an RDD first.

DataFrame is an abstraction on top of RDD. You can create a DataFrame from RDD, or directly, as mentioned here:

df = spark.createDataFrame(
    [
        (1, "foo"),  # create your data here, be consistent in the types.
        (2, "bar"),
    ],
    ["id", "label"]  # add your column names here
)

Regardless of how you created the DataFrame, it will still have a .rdd member.

CodePudding user response:

I am copying the code given in the question here.

data = [Row([[1, 2], [3]]), Row([[1], [3, 2], [2]]), Row([[1, 2], [5]])]    # Line 1
columns = ["seq"]                                                           # Line 2
rdd = spark.sparkContext.parallelize(data)                                  # Line 3
df = spark.createDataFrame(data).toDF(*columns)                             # Line 4 (removed .show())

Here, we are not using rdd variable anywhere so you don't actually need to even create it. Line 3 can be removed and df will still be the same. And this is the exact way we create a DataFrame directly from a list. There is one other way which you might see very frequently.

df = spark.createDataFrame(data, schema=columns)

This will also create the same DataFrame.

  • Related