Regex in Spark SQL is resulting in wrong value but working fine on Amazon Athena-CodePudding

I am trying to extract text that exists inside root level brackets from a string in Spark-SQL. I have used the function regexp_extract() on both Spark-SQL and Athena on the same string with the same regex.

On Athena, it's working fine.

But on Spark-SQL, it is not returning the value as expected.

Query is:

SELECT regexp_extract('Russia (Federal Service of Healthcare)', '. \s\((. )\)', 1) AS cl

Output On Athena:

Federal Service of Healthcare

Output on Spark-SQL:

ia (Federal Service of Healthcare)

I am bumping my head around but can't seem to find a solution around this.

CodePudding user response：

This does the trick:

SELECT regexp_extract('Russia (Federal Service of Healthcare)', '. \\\\s\\\\((. )\\\\)', 1) AS cl

output:

 ----------------------------- 
|cl                           |
 ----------------------------- 
|Federal Service of Healthcare|
 -----------------------------

The s is not being escaped in your example, that's why it falls as part of the group; you can also use the regexp_extract API directly which makes a cleaner solution:

.withColumn("cl", regexp_extract(col("name"), ". \\s\\((. )\\)", 1))

Good luck!