Home > other >  How to load data from multiple subfolders in datalake to a dataframe in azure databricks
How to load data from multiple subfolders in datalake to a dataframe in azure databricks

Time:04-25

I have few of the csv file which is present in multiple subfolders in azure datalake. I want to take all the data present in those files and load them to a dataframe in azure databricks. The structure of the folder and subfolder in databricks is given below

Folder name -> YearName2019 -> Month1 ->filename
                            -> Month2 ->filename
                             .
                            ->Month12 ->filename


           -> YearName2020  ->Month1  ->filename
                            .
                            .
                            ->Month12 ->filename
   

I am trying to read the data from all the subfolders using the following code but it's not Working

df=spark.read.load('/FolderName/*' ,format='csv' ,sep=',' ,header='True' ,inferSchema=True)

CodePudding user response:

You can first gather all the paths to load, e.g. make a list of paths listOfPaths and then pass it as an argument to the .load method.

Take a look at an example below:

val listOfPaths: List[String] = List("Folder name/YearName2019/Month1/filename","Folder name/YearName2019/Month2/filename")
val dataDf = spark.read.format("csv")
  .option("sep", ";")
  .option("inferSchema", "true")
  .option("header", "true")
  .load(listOfPaths: _*)

Maybe you will be interested in other topics related to the Spark, I encourage you to visit my blog.: https://bigdata-etl.com/articles/big-data/apache-spark/

  • Related