Home > Mobile >  Reading multiple directories into multiple spark dataframes
Reading multiple directories into multiple spark dataframes

Time:01-12

I'm trying to read a list of directories each into its own dataframe. E.g.

dir_list = ['dir1', 'dir2', ...]
df1 = spark.read.csv(dir_list[0])
df2 = spark.read.csv(dir_list[1])
...

Each directory has data of varying schemas.

I want to do this in parallel, so just a simple for loop won't work. Is there any way of doing this?

CodePudding user response:

Yes there is :))

"How" will depend on what kind of processing you do after read, because by itself the spark.read.csv(...) won't execute until you call an action (due to Spark's lazy evaluation) and putting multiple reads in a for loop will work just fine.

So, if the results of evaluating multiple dataframes have the same schema, parallelism can be simply achieved by UNION'ing them. For example,

df1 = spark.read.csv(dir_list[0])
df2 = spark.read.csv(dir_list[1])
df1.withColumn("dfid",lit("df1")).groupBy("dfid").count()
   .union(df2.withColumn("dfid",lit("df1")).groupBy("dfid").count())
   .show(truncate=False)

... will cause both dir_list[0] and dir_list[1] to be read in parallel.

If this is not feasible, then there is always a Spark Scheduling route:

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action.

  • Related