Ignore path does not exist in pyspark-CodePudding

I want ignore the paths that generate the Error:

'Path does not exist'

when I read parquet files with pyspark. For example I have a list of paths:

list_paths = ['path1','path2','path3']

and read the files like:

dataframe = spark.read.parquet(*list_paths)

but the path path2 does not exist. In general, I do not know which path does not exits, so I want ignore path2 automatically. How can I do it and obtain only one dataframe?

CodePudding user response：

Maybe you can do

existing_paths = [path for path in list_paths if os.path.exists(path)]
dataframe = spark.read.parquet(*existing_paths)

CodePudding user response：

You can use Hadoop FS API to check if the files exist before you pass them to spark.read:

conf = sc._jsc.hadoopConfiguration()
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path


filtered_paths = [p for p in list_paths if Path(p).getFileSystem(conf).exists(Path(p))]

dataframe = spark.read.parquet(*filtered_paths)

Where sc is the SparkContext.