how to read data from multiple folder from adls to databricks dataframe-CodePudding

file path format is data/year/weeknumber/no of day/data_hour.parquet

data/2022/05/01/00/data_00.parquet

data/2022/05/01/01/data_01.parquet

data/2022/05/01/02/data_02.parquet

data/2022/05/01/03/data_03.parquet

data/2022/05/01/04/data_04.parquet

data/2022/05/01/05/data_05.parquet

data/2022/05/01/06/data_06.parquet

data/2022/05/01/07/data_07.parquet

how to read all this file one by one in data bricks notebook and store into the data frame

import pandas as pd 

#Get all the files under the folder
data = dbutils.fs.la(file)

df = pd.DataFrame(data)

#Create the list of file
list = df.path.tolist()

    enter code here

for i in list:
    df = spark.read.load(path=f'{f}*',format='parquet')

i can able to read only the last file skipping the other file

CodePudding user response：

The last line of your code cannot load data incrementally. In contrast, it refreshes df variable with the data from each path for each time it ran.

Removing the for loop and trying the code below would give you an idea how file masking with asterisks works. Note that the path should be a full path. (I'm not sure if the data folder is your root folder or not)

df = spark.read.load(path='/data/2022/05/*/*/*.parquet',format='parquet')

This is what I have applied from the same answer I shared with you in the comment.

Kindly accept my answer if it works. Thanks!