Home > Software engineering >  Regex in Spark SQL is resulting in wrong value but working fine on Amazon Athena
Regex in Spark SQL is resulting in wrong value but working fine on Amazon Athena

Time:07-23

I am trying to extract text that exists inside root level brackets from a string in Spark-SQL. I have used the function regexp_extract() on both Spark-SQL and Athena on the same string with the same regex.

On Athena, it's working fine.

But on Spark-SQL, it is not returning the value as expected.

Query is:

SELECT regexp_extract('Russia (Federal Service of Healthcare)', '. \s\((. )\)', 1) AS cl

Output On Athena:

Federal Service of Healthcare

Output on Spark-SQL:

ia (Federal Service of Healthcare)

I am bumping my head around but can't seem to find a solution around this.

CodePudding user response:

This does the trick:

SELECT regexp_extract('Russia (Federal Service of Healthcare)', '. \\\\s\\\\((. )\\\\)', 1) AS cl

output:

 ----------------------------- 
|cl                           |
 ----------------------------- 
|Federal Service of Healthcare|
 ----------------------------- 

The s is not being escaped in your example, that's why it falls as part of the group; you can also use the regexp_extract API directly which makes a cleaner solution:

.withColumn("cl", regexp_extract(col("name"), ". \\s\\((. )\\)", 1))

Good luck!

  • Related