Home > database >  Pyspark: isDeltaTable running forever
Pyspark: isDeltaTable running forever

Time:02-11

I want to check if a delta table in an s3 bucket is actually a delta table. I am trying do this by

    from delta import *
    from delta.tables import DeltaTable
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import *

    spark = SparkSession.builder\
                        .appName('test')\
                        .getOrCreate()

    if DeltaTable.isDeltaTable(spark, "s3a://landing-zone/table_name/year=2022/month=2/part-0000-xyz.snappy.parquet"):
      print("bla")
    else:
      print("blabla")

This code runs forever without returning any result. I tested it with a local delta table and there it works. When I trim the path url so it stops after the actual table name, the code shows the same behavior. I also generated a boto3 client and I can see the bucket list when calling s3.list_bucket(). Do I need to parse the client somehow into the if statement?

Thanks a lot in advance!

CodePudding user response:

I am an idiot, I forgot that it is not enough to just create a boto3 client, but I also have to make the actual connection to S3 via

spark._jsc.hadoopConfiguration().set(...)
  • Related