Filter thresold Size data to read using Pyspark or Python-CodePudding

I have n number orc files in a path, among them around 150 files are null or incomplete size, I want to ignore all those while reading through pyspark. I have written the following, but I need some help as it's not working.

path = "/home/data/raw_data/"
file_list = os.listdir(path)
for file in file_list:
    size=os.path.getsize(os.path.join(path, file))
    if size > 6500: # want to import which is greater than 6.5 Mb
        file_list.append(size)
raw_df = spark.read.format("orc").load(path)

CodePudding user response：

the issue in your above code is

file_list.append(size) ---> which is not required and 
reading data from spark should be inside loop.

from pyspark.sql import DataFrame
from functools import reduce

df_list =[]
path = "/home/data/raw_data/"
file_list = os.listdir(path)
for file in file_list:
    size=os.path.getsize(os.path.join(path, file))
    if size > 6500: 
       raw_df = spark.read.format("orc").load(path file)
       df_list.append(raw_df)
df_fnl = reduce(DataFrame.unionByName,df_list)

Kindly upvote of you like my solution.

CodePudding user response：

The amount of files can be very big and the loop inefficient. An alternative is to load all files and then filter only the files needed. You can see the source of the file with the function input_file_name().

Then if you have a df all filenames you need, you can inner join on the input_file_name and your helper df and then only entries from the files you required are kept.