Home > Net >  regexp_replace fails in presto but works on regex101 with java8 flavor
regexp_replace fails in presto but works on regex101 with java8 flavor

Time:12-28

The following regex works on regex101 (with the Java8 flavor), but doesn't work when running it with regexp_replace in Presto SQL.

Regex: (?<!timeFrame:\"(?:\d\d?\d?\d?)?)[0-9.\/,]

Test string: timeFrame:"122 Days" ipAddress:1.1.1.1

Expected outcome: timeFrame:"122 Days" ipAddress:

You can see it working on regex101 here: https://regex101.com/r/udpivA/1

According to Presto's documentation, it should be running "the Java pattern syntax".

When running this function in Presto: regexp_replace(string, '(?<!timeFrame:\"(?:\d\d?\d?\d?)?)[0-9.\/,]', '')

I get this error:

java.sql.SQLException: [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. INVALID_FUNCTION_ARGUMENT: invalid pattern in look-behind

How can I get this regex to work on Presto?

CodePudding user response:

The error message mentions an invalid pattern in the look-behind pattern, which most probably means that presto does not support look-behind patterns that could match strings with varying lengths, and so it would not allow the ? quantifier (nor or *) in a look-behind pattern.

You can avoid this by actually matching the "timeframe" pattern, and capturing it in a capture group, so that you can reproduce it in the replacement string with $1 -- which practically means that you don't actually delete it. With | you can then add another alternative that also matches digits (but which don't match with the first option). If those are matched, the capture group (around the "timeframe" pattern) is empty, and so $1 represents the empty string, which practically means that those matches are removed:

regexp_replace(string, '(timeFrame:\"\d )|\d[0-9.\/,]*', '$1')

NB: I have prefixed the [0-9.\/,] pattern with \d as I assume you don't want to remove points and commas when they are not part of a term that starts with a digit.

  • Related