I'm trying to get the table name from parquet files using regex. I'm using the following code to attempt this but the ctSchema
dataframe doesn't seem to run causing the job to return 0 results.
ci= spark.createDataFrame(data=[("","","")], schema=ciSchema)
files=dbutils.fs.ls('a filepath goes here')
results = {}
is_error = False
for fi in files:
try:
dataFile = spark.read.parquet(fi.path)
ctSchema = spark.createDataFrame(data = dataFile.dtypes, schema = tSchema).withColumn("TableName", regexp_extract(input_file_name(),"([a-zA-Z0-9] _[a-zA-Z0-9] )_shard_\d _of_\d \.parquet",1), lit(fi.name))
ci = ci.union(ctSchema)
except Exception as e:
results[fi.name] = f"Error: {e}"
is_error = True
CodePudding user response:
your regex ([a-zA-Z0-9] _[a-zA-Z0-9] )_shard_\d _of_\d \.parquet
is incorrect, try this one instead [a-zA-Z0-9] _([a-zA-Z0-9] )_page_\d _of_\d \.parquet
.
First, I used page_
instead of shard_
, which matches your file name.
Second, you don't want to group ([a-zA-Z0-9] _[a-zA-Z0-9] )
which would match TCP_119Customer
. You only want the second group, so change it to [a-zA-Z0-9] _([a-zA-Z0-9] )
will fix the issue.