I want ignore the paths that generate the Error:
'Path does not exist'
when I read parquet files with pyspark. For example I have a list of paths:
list_paths = ['path1','path2','path3']
and read the files like:
dataframe = spark.read.parquet(*list_paths)
but the path path2
does not exist. In general, I do not know which path does not exits, so I want ignore path2
automatically. How can I do it and obtain only one dataframe?
CodePudding user response:
Maybe you can do
existing_paths = [path for path in list_paths if os.path.exists(path)]
dataframe = spark.read.parquet(*existing_paths)
CodePudding user response:
You can use Hadoop FS API to check if the files exist before you pass them to spark.read
:
conf = sc._jsc.hadoopConfiguration()
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
filtered_paths = [p for p in list_paths if Path(p).getFileSystem(conf).exists(Path(p))]
dataframe = spark.read.parquet(*filtered_paths)
Where sc
is the SparkContext.