Home > OS >  Regular expressions in Databricks Spark SQL
Regular expressions in Databricks Spark SQL

Time:06-24

Curious thing. Using Databricks, but getting two different behaviors when using the replace_regex functionality.

%sql
select upper(regexp_replace('Test (PA) (1234) ', '[(]\\d [)]', '')) as result

returns my expectation: TEST (PA)

Yet:

%python    
display(spark.sql("""select upper(regexp_replace('Test (PA) (1234) ', '[(]\\d [)]', '')) as result"""))

returns: TEST (PA) (1234)

Should these not be equivalent in the two different contexts (spark sql, and spark / python / pyspark)? Why doesn't the second context work?

CodePudding user response:

You don't take into account character escaping rules. In Python your \\ is converted into a single \, so your regex changes to [(]\d [)]. You have choice:

  • use raw string r"""select upper(regexp_replace('Test (PA) (1234) ', '[(]\\d [)]', '')) as result"""

  • double backslashes: [(]\\\\d [)]

with any of these changes, it produces correct result

  • Related