How to use regex to parse the Tablename from a file in PySpark databricks notebook-CodePudding

I'm trying to get the table name from parquet files using regex. I'm using the following code to attempt this but the ctSchema dataframe doesn't seem to run causing the job to return 0 results.

    ci= spark.createDataFrame(data=[("","","")], schema=ciSchema)
    files=dbutils.fs.ls('a filepath goes here')
    
    results = {}
    is_error = False
    
    for fi in files:
          try:
            dataFile = spark.read.parquet(fi.path)
            ctSchema = spark.createDataFrame(data = dataFile.dtypes, schema = tSchema).withColumn("TableName", regexp_extract(input_file_name(),"([a-zA-Z0-9] _[a-zA-Z0-9] )_shard_\d _of_\d \.parquet",1), lit(fi.name))
            ci = ci.union(ctSchema)
          except Exception as e:
            results[fi.name] = f"Error: {e}"
            is_error = True

CodePudding user response：

your regex ([a-zA-Z0-9] _[a-zA-Z0-9] )_shard_\d _of_\d \.parquet is incorrect, try this one instead [a-zA-Z0-9] _([a-zA-Z0-9] )_page_\d _of_\d \.parquet.

First, I used page_ instead of shard_, which matches your file name.

Second, you don't want to group ([a-zA-Z0-9] _[a-zA-Z0-9] ) which would match TCP_119Customer. You only want the second group, so change it to [a-zA-Z0-9] _([a-zA-Z0-9] ) will fix the issue.