regexp extract pyspark sql: ParseException Literals of type 'R' are currently not supporte-CodePudding

I'm using Pyspark SQL with regexp_extract in this way:

df = spark.createDataFrame([['id_20_30', 10], ['id_40_50', 30]], ['id', 'age'])
df.createOrReplaceTempView("table")
sql_statement="""
select regexp_extract(id, r'(\d )', 1) as id
from table
"""
df = spark.sql(sql_statement)

But I obtain this error:

Traceback (most recent call last):
  File "/tmp/spark-5b33a47c-6490-4f80-9a97-3acb37f683ec/script.py", line 86, in <module>
    df = spark.sql(sql_statement)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 723, in sql
  File "/usr/lib/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1310, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.ParseException: 
Literals of type 'R' are currently not supported.

The result should be

 --- 
| id|
 --- 
| 20|
| 40|
 ---

If I remove the r it doesn't fail but it creates a wrong result. If I try this code with Pyspark locally it works (Pyspark version 3.3.0), but when I run this code in an EMR job it fails, I'm using emr-6.6.0 as application. Do you have any suggestion? I know I could try to not use pyspark sql but I would prefer to proceed with it.

CodePudding user response：

You need to escape the '\' with '\' and you need two of them.

sql_statement="select regexp_extract(id, '([0-9] )', 1) as id from table "
#or
sql_statement="select regexp_extract(id, '(\\\\d )', 1) as id from table "
df = spark.sql(sql_statement)
df.show()
 --- 
| id|
 --- 
| 20|
| 40|
 ---