I am trying to extract text that exists inside root level brackets from a string in Spark-SQL. I have used the function regexp_extract() on both Spark-SQL and Athena on the same string with the same regex.
On Athena, it's working fine.
But on Spark-SQL, it is not returning the value as expected.
Query is:
SELECT regexp_extract('Russia (Federal Service of Healthcare)', '. \s\((. )\)', 1) AS cl
Output On Athena:
Federal Service of Healthcare
Output on Spark-SQL:
ia (Federal Service of Healthcare)
I am bumping my head around but can't seem to find a solution around this.
CodePudding user response:
This does the trick:
SELECT regexp_extract('Russia (Federal Service of Healthcare)', '. \\\\s\\\\((. )\\\\)', 1) AS cl
output:
-----------------------------
|cl |
-----------------------------
|Federal Service of Healthcare|
-----------------------------
The s
is not being escaped in your example, that's why it falls as part of the group; you can also use the regexp_extract
API directly which makes a cleaner solution:
.withColumn("cl", regexp_extract(col("name"), ". \\s\\((. )\\)", 1))
Good luck!