Home > Back-end >  Regex in Scala giving unexpected additional results
Regex in Scala giving unexpected additional results

Time:07-06

I want to identify columns which start with the characters id

we tried ^id* and also ^(id)* which work in Regexr but not in Scala

Regexr:

RegExr screenshot

Scala (Databricks)

Databricks Screenshot

CodePudding user response:

answer was ^(id) without the trailing *

screenshot of working notebook

CodePudding user response:

tl;dr

So just a brief explanation over your problem, because I think just because you could solve this specific problem without knowing the reason and the logic behind it, doesn't mean the problem will never happen again. When you use asterisk in regular expressions, it means any number of some pattern (including 0), so in your first case:

val firstRegex = "^id*".r
// this means the string must start with "i", and can have any consecutive number of "d"s appended to it.
firstRegex.matches("i") // true
firstRegex.matches("idddddddd") // true
firstRegex.matches("iPhone") // false, the expression only accepts sequence of "d"s appended to a single "i"

About your second regex, as you can guess, it accepts any number of string "id" appended to each other (including 0):

val secondRegex = "^(id)*".r

secondRegex.matches("ididididid") // true
secondRegex.matches("idi") // false
secondRegex.matches("") // true, zero "id"s

Wildcard

So in your case, you want your column name to start with string id, no matter the rest. dot (.) is that special character in almost all the regular expression engines, that matches everything. Knowing that, you can say

I want my column to start with "id", and any number(asterisk) of any character(wildcard) after that

So:

@ val columnNamePattern = "^id.*".r 
columnNamePattern: scala.util.matching.Regex = ^id.*

@ columnNamePattern.matches("identifier") 
res15: Boolean = true

@ columnNamePattern.matches("merchant_name") 
res16: Boolean = false
  • Related