Suppose you have two s3 buckets that you want to read a spark data frame from. For one file reading in a spark data frame would look like this:
file_1 = ("s3://loc1/")
df = spark.read.option("MergeSchema","True").load(file_1)
If we have two files:
file_1 = ("s3://loc1/")
file_2 = ("s3://loc2/")
how would we read in a spark data frame? Is there a way to merge those two file locations?
CodePudding user response:
As the previous comment states, you could read in each individually and then do a union function.
Another option could be to try the Spark RDD API and then convert that into a data frame. So for example:
sc = spark.sparkContext
raw_data_RDD = sc.textfile(<dir1> , <dir2>, ...)
For nested directories, you can do wildcard symbol (*). Now one thing you have to consider is whether your schemas for both locations are equal. You may have to do some pre-processing before converting to the dataframe. Once your schema is set up, you can just do:
raw_df = spark.createDataFrame(raw_data_RDD, schema=<schema>)