Reading in Spark data frame from multiple files-CodePudding

Suppose you have two s3 buckets that you want to read a spark data frame from. For one file reading in a spark data frame would look like this:

file_1 = ("s3://loc1/")
df = spark.read.option("MergeSchema","True").load(file_1)

If we have two files:

 file_1 = ("s3://loc1/")
 file_2 = ("s3://loc2/")

how would we read in a spark data frame? Is there a way to merge those two file locations?

CodePudding user response：

As the previous comment states, you could read in each individually and then do a union function.

Another option could be to try the Spark RDD API and then convert that into a data frame. So for example:

sc = spark.sparkContext

raw_data_RDD = sc.textfile(<dir1> , <dir2>, ...)

For nested directories, you can do wildcard symbol (*). Now one thing you have to consider is whether your schemas for both locations are equal. You may have to do some pre-processing before converting to the dataframe. Once your schema is set up, you can just do:

raw_df = spark.createDataFrame(raw_data_RDD, schema=<schema>)