Data:
--- -------------------------------------------------------------
|id |filters |
--- -------------------------------------------------------------
|1 |{"option.p.one":["A","B","C","D"], "option.p.type_two":["1"]}|
--- -------------------------------------------------------------
Code to generate data:
columns = ["id","filters"]
data = [(1, '{"option.p.one":["A","B","C","D"], "option.p.type_two":["1"]}')]
rdd = sc.parallelize(data)
I have two regex written to extract and store the key and values of this string.
- Keys:
\.([a-z_] )\":'
- Values:
:\[([^:] )\
Now, when I run the following code, I see that my values regex does not return the intended results.
dfFromRDD2.withColumn("filter_category", expr(f"regexp_extract_all(filters, '\.([a-z_] )\":', 1)"))\
.withColumn("filter_inputs", expr(f"regexp_extract_all(filters, ':\[([^:] )\]', 0)")).show(truncate = False)
--- ------------------------------------------------------------- --------------- -------------
|id |filters |filter_category|filter_inputs|
--- ------------------------------------------------------------- --------------- -------------
|1 |{"option.p.one":["A","B","C","D"], "option.p.type_two":["1"]}|[one, type_two]|[:[, :[] |
--- ------------------------------------------------------------- --------------- -------------
Both regex work fine in Presto SQL
CodePudding user response:
You need to escape special characters in the python string eg.
dfFromRDD2.withColumn("filter_category", expr(f"regexp_extract_all(filters, '\\\.([a-z_] )\\\":', 1)"))\
.withColumn("filter_inputs", expr(f"regexp_extract_all(filters, ':\\\[([^:] )\\\]', 0)")).show(truncate = False)
Let me know if the above works for you.