I have a regex
which works with Athena
but not in spark-sql
.
What should I change in spark-sql
so I will get same output as in Athena
?
Sample Input: 'ASEIAW,1245555,asda2dd,TPOIBV'
Expected output: ['ASEIAW,TPOIBV'] - An array with all matching elements which is an alphabet in upper case with exactly 6 charterers.
Athena execution:
SELECT REGEXP_EXTRACT_ALL('ASEIAW,1245555,asda2dd,TPOIBV' , '(\b[A-Z]{6}\b) '), 'ASEIAW,1245555,asda2dd,TPOIBV' as INPUT;
_col0 |INPUT |
------------------- -----------------------------
['ASEIAW','TPOIBV']|ASEIAW,1245555,asda2dd,TPOIBV|
SPARK SQL execution:
spark.sql("""SELECT REGEXP_EXTRACT_ALL('ASEIAW,1245555,asda2dd,TPOIBV' , '(\b[A-Z]{6}\b) ') as regex_out, 'ASEIAW,1245555,asda2dd,TPOIBV' as INPUT""").show()
--------- --------------------
|regex_out| INPUT|
--------- --------------------
| []|ASEIAW,1245555,as...|
--------- --------------------
I want to see the same output in spark which I see in Athena. What should I change?
CodePudding user response:
Try escaping the \
characters:
spark.sql("""SELECT REGEXP_EXTRACT_ALL('ASEIAW,1245555,asda2dd,TPOIBV' , '(\\\\b[A-Z]{6}\\\\b) ') as regex_out, 'ASEIAW,1245555,asda2dd,TPOIBV' as INPUT""").show()
CodePudding user response:
The following seems to work. You needed more \
spark.sql("""SELECT REGEXP_EXTRACT_ALL('ASEIAW,1245555,asda2dd,TPOIBV' , '(\\\\b[A-Z]{6}\\\\b) ') as regex_out, 'ASEIAW,1245555,asda2dd,TPOIBV' as INPUT""").show()