I have the following pattern that I could parse using pandas in Python, but struggle with translating the code into Scala.
grade string_column
85 (str:ann smith,14)(str:frank chase,15)
86 (str:john foo,15)(str:al more,14)
In python I used:
df.set_index('grade')['string_column']\
.str.extractall(r'\((str:[^,] ),(\d )\)')\
.droplevel(1)
with the output:
grade 0 1
85 str:ann smith 14
85 str:frank chase 15
86 str:john foo 15
86 str:al more 14
In Scala I tried to duplicate the approach, but it's failing:
import scala.util.matching.Regex
val pattern = new Regex("((str:[^,] ),(\d )\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
println(pattern findAllIn(str)).mkString(","))
CodePudding user response:
There are a few notes about the code:
- There is an unmatched parenthesis for a group, but that one should be escaped
- The backslashes should be double escaped
- In the
println
you don't have to use all the parenthesis and the dot - findAllIn returns a MatchIterator, and looping those will expose a matched string. Joining those matched strings with a comma, will in this case give back the same string again.
For example
import scala.util.matching.Regex
val pattern = new Regex("\\((str:[^,] ),(\\d )\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
println(pattern findAllIn str mkString ",")
Output
(str:ann smith,14),(str:frank chase,15)
But if you want to print out the group 1 and group 2 values, you can use findAllMatchIn that returns a collection of Regex Matches:
import scala.util.matching.Regex
val pattern = new Regex("\\((str:[^,] ),(\\d )\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
pattern findAllMatchIn str foreach(m => {
println(m.group(1))
println(m.group(2))
}
)
Output
str:ann smith
14
str:frank chase
15
CodePudding user response:
In Python, Series.str.extractall
only returns captured substrings. In Scala, findAllIn
returns the matched values if you do not query its matchData
property that in its turn contains a subgroups
property.
So, to get the captures only in Scala, you need to use
val pattern = """\((str:[^,()] ),(\d )\)""".r
val str = "(str:ann smith,14)(str:frank chase,15)"
(pattern findAllIn str).matchData foreach {
m => println(m.subgroups.mkString(","))
}
Output:
str:ann smith,14
str:frank chase,15
See the Scala online demo.
Here, m.subgroups
accesses all subgroups (captures) of each match (m
).
Also, note you do not need to double backslashes in triple-quoted string literals. \((str:[^,()] ),(\d )\)
matches
\(
- a(
char(str:[^,()] )
- Group 1:str:
and one or more chars other than,
,(
and)
,
- a comma(\d )
- Group 2: one or more digits\)
- a)
char.
If you just want to get all matches without captures, you can use
val pattern = """\((str:[^,] ),(\d )\)""".r
println((pattern findAllIn str).matchData.mkString(","))
Output:
(str:ann smith,14),(str:frank chase,15)
See the online demo.