Curious thing. Using Databricks, but getting two different behaviors when using the replace_regex functionality.
%sql
select upper(regexp_replace('Test (PA) (1234) ', '[(]\\d [)]', '')) as result
returns my expectation: TEST (PA)
Yet:
%python
display(spark.sql("""select upper(regexp_replace('Test (PA) (1234) ', '[(]\\d [)]', '')) as result"""))
returns: TEST (PA) (1234)
Should these not be equivalent in the two different contexts (spark sql, and spark / python / pyspark)? Why doesn't the second context work?
CodePudding user response:
You don't take into account character escaping rules. In Python your \\
is converted into a single \
, so your regex changes to [(]\d [)]
. You have choice:
use raw string
r"""select upper(regexp_replace('Test (PA) (1234) ', '[(]\\d [)]', '')) as result"""
double backslashes:
[(]\\\\d [)]
with any of these changes, it produces correct result