I have few of the csv file which is present in multiple subfolders in azure datalake. I want to take all the data present in those files and load them to a dataframe in azure databricks. The structure of the folder and subfolder in databricks is given below
Folder name -> YearName2019 -> Month1 ->filename
-> Month2 ->filename
.
->Month12 ->filename
-> YearName2020 ->Month1 ->filename
.
.
->Month12 ->filename
I am trying to read the data from all the subfolders using the following code but it's not Working
df=spark.read.load('/FolderName/*' ,format='csv' ,sep=',' ,header='True' ,inferSchema=True)
CodePudding user response:
You can first gather all the paths to load, e.g. make a list of paths listOfPaths
and then pass it as an argument to the .load
method.
Take a look at an example below:
val listOfPaths: List[String] = List("Folder name/YearName2019/Month1/filename","Folder name/YearName2019/Month2/filename")
val dataDf = spark.read.format("csv")
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.load(listOfPaths: _*)
Maybe you will be interested in other topics related to the Spark, I encourage you to visit my blog.: https://bigdata-etl.com/articles/big-data/apache-spark/