Home > other >  Regex that works on Athena does not work in spark-sql
Regex that works on Athena does not work in spark-sql

Time:07-27

I have a regex which works with Athena but not in spark-sql. What should I change in spark-sql so I will get same output as in Athena?

Sample Input: 'ASEIAW,1245555,asda2dd,TPOIBV'
Expected output: ['ASEIAW,TPOIBV'] - An array with all matching elements which is an alphabet in upper case with exactly 6 charterers.

Athena execution:

SELECT REGEXP_EXTRACT_ALL('ASEIAW,1245555,asda2dd,TPOIBV' , '(\b[A-Z]{6}\b) '), 'ASEIAW,1245555,asda2dd,TPOIBV' as INPUT;
    
_col0              |INPUT                        |
------------------- ----------------------------- 
['ASEIAW','TPOIBV']|ASEIAW,1245555,asda2dd,TPOIBV|

SPARK SQL execution:

spark.sql("""SELECT REGEXP_EXTRACT_ALL('ASEIAW,1245555,asda2dd,TPOIBV' , '(\b[A-Z]{6}\b) ') as regex_out, 'ASEIAW,1245555,asda2dd,TPOIBV' as INPUT""").show()
 --------- -------------------- 
|regex_out|               INPUT|
 --------- -------------------- 
|       []|ASEIAW,1245555,as...|
 --------- -------------------- 

I want to see the same output in spark which I see in Athena. What should I change?

CodePudding user response:

Try escaping the \ characters:

spark.sql("""SELECT REGEXP_EXTRACT_ALL('ASEIAW,1245555,asda2dd,TPOIBV' , '(\\\\b[A-Z]{6}\\\\b) ') as regex_out, 'ASEIAW,1245555,asda2dd,TPOIBV' as INPUT""").show()

CodePudding user response:

The following seems to work. You needed more \

spark.sql("""SELECT REGEXP_EXTRACT_ALL('ASEIAW,1245555,asda2dd,TPOIBV' , '(\\\\b[A-Z]{6}\\\\b) ') as regex_out, 'ASEIAW,1245555,asda2dd,TPOIBV' as INPUT""").show()
  • Related