Regex is working in presto SQL but is not working in pyspark expr-CodePudding

Data:

 --- ------------------------------------------------------------- 
|id |filters                                                      |
 --- ------------------------------------------------------------- 
|1  |{"option.p.one":["A","B","C","D"], "option.p.type_two":["1"]}|
 --- -------------------------------------------------------------

Code to generate data:

columns = ["id","filters"]
data = [(1, '{"option.p.one":["A","B","C","D"], "option.p.type_two":["1"]}')]

rdd = sc.parallelize(data)

I have two regex written to extract and store the key and values of this string.

Keys: \.([a-z_] )\":'
Values: :\[([^:] )\

Now, when I run the following code, I see that my values regex does not return the intended results.

dfFromRDD2.withColumn("filter_category", expr(f"regexp_extract_all(filters, '\.([a-z_] )\":', 1)"))\
.withColumn("filter_inputs", expr(f"regexp_extract_all(filters, ':\[([^:] )\]', 0)")).show(truncate = False)


 --- ------------------------------------------------------------- --------------- ------------- 
|id |filters                                                      |filter_category|filter_inputs|
 --- ------------------------------------------------------------- --------------- ------------- 
|1  |{"option.p.one":["A","B","C","D"], "option.p.type_two":["1"]}|[one, type_two]|[:[, :[]     |
 --- ------------------------------------------------------------- --------------- -------------

Both regex work fine in Presto SQL

CodePudding user response：

You need to escape special characters in the python string eg.

dfFromRDD2.withColumn("filter_category", expr(f"regexp_extract_all(filters, '\\\.([a-z_] )\\\":', 1)"))\
.withColumn("filter_inputs", expr(f"regexp_extract_all(filters, ':\\\[([^:] )\\\]', 0)")).show(truncate = False)

Let me know if the above works for you.