Home > OS >  Scala regex on a whole column
Scala regex on a whole column

Time:12-20

I have the following pattern that I could parse using pandas in Python, but struggle with translating the code into Scala.

grade string_column    
 85   (str:ann smith,14)(str:frank chase,15)
 86   (str:john foo,15)(str:al more,14)

In python I used:

df.set_index('grade')['string_column']\
.str.extractall(r'\((str:[^,] ),(\d )\)')\
.droplevel(1)

with the output:

 grade       0                1
 85      str:ann smith       14
 85      str:frank chase     15
 86      str:john foo        15
 86      str:al more         14

In Scala I tried to duplicate the approach, but it's failing:

import scala.util.matching.Regex

val pattern = new Regex("((str:[^,] ),(\d )\)")
val str = "(str:ann smith,14)(str:frank chase,15)"

println(pattern findAllIn(str)).mkString(","))

CodePudding user response:

There are a few notes about the code:

  • There is an unmatched parenthesis for a group, but that one should be escaped
  • The backslashes should be double escaped
  • In the println you don't have to use all the parenthesis and the dot
  • findAllIn returns a MatchIterator, and looping those will expose a matched string. Joining those matched strings with a comma, will in this case give back the same string again.

For example

import scala.util.matching.Regex

val pattern = new Regex("\\((str:[^,] ),(\\d )\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"

println(pattern findAllIn str mkString ",")

Output

(str:ann smith,14),(str:frank chase,15)

But if you want to print out the group 1 and group 2 values, you can use findAllMatchIn that returns a collection of Regex Matches:

import scala.util.matching.Regex

val pattern = new Regex("\\((str:[^,] ),(\\d )\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
pattern findAllMatchIn str foreach(m => {
    println(m.group(1))
    println(m.group(2))
  }
)

Output

str:ann smith
14
str:frank chase
15

CodePudding user response:

In Python, Series.str.extractall only returns captured substrings. In Scala, findAllIn returns the matched values if you do not query its matchData property that in its turn contains a subgroups property.

So, to get the captures only in Scala, you need to use

val pattern = """\((str:[^,()] ),(\d )\)""".r
val str = "(str:ann smith,14)(str:frank chase,15)"
(pattern findAllIn str).matchData foreach {
    m => println(m.subgroups.mkString(","))
}

Output:

str:ann smith,14
str:frank chase,15

See the Scala online demo.

Here, m.subgroups accesses all subgroups (captures) of each match (m).

Also, note you do not need to double backslashes in triple-quoted string literals. \((str:[^,()] ),(\d )\) matches

  • \( - a ( char
  • (str:[^,()] ) - Group 1: str: and one or more chars other than ,, ( and )
  • , - a comma
  • (\d ) - Group 2: one or more digits
  • \) - a ) char.

If you just want to get all matches without captures, you can use

val pattern = """\((str:[^,] ),(\d )\)""".r
println((pattern findAllIn str).matchData.mkString(","))

Output:

(str:ann smith,14),(str:frank chase,15)

See the online demo.

  • Related