Firstly I'm trying to understand this. Second I would like to use it.
# test string
$pgNumString = 'C:\test\test5\AALTONEN-ALLAN_PENCARROW_PAGE_1.txt'
# Regex with capture group for number '1' ONLY from $pgNumString
# In other use cases it may be page 10 or any page in 100s
$pgNumRegex = "(?s)_(\d )\."
# Simplest - not using -SimpleMatch because this example uses regex (Select-String docs)
$pgNum = $pgNumString | Select-String -Pattern $pgNumRegex -AllMatches
The match is not assigned to $pgNum
. No capture grouping means no good anyway. A slightly more sophisticated attempt:
$pgNum = $pgNumString | Select-String -Pattern $pgNumRegex -AllMatches | Select-Object {$_.Matches.Groups[1].Value}
Output:
$_.Matches.Groups[1].Value
--------------------------
1
The match is still not assigned to $pgNum
. But the output shows I'm on the right track. What am I doing wrong?
CodePudding user response:
Especially if you're dealing with strings already in memory, but often also with files (except if they're exceptionally large), use of Select-String
isn't necessary and both slows down and complicates the solution, as your example shows.
While -match
works in principle too - to focus on matching only what should be extracted - it is limited to one match, whose results are reflected in the automatic $Matches
variable.
However, you can make direct use of an underlying .NET API, namely [regex]::Matches()
.
# Sample input.
$pgNumString = @'
C:\test\test5\AALTONEN-ALLAN_PENCARROW_PAGE_1.txt
C:\test\test6\AALTONEN-ALLAN_PENCARROW_PAGE_42.txt
'@
# -> '1', '42'
# Note: To match PowerShell's case-*insensitive* behavior (not relevant here), use:
# [regex]::Matches($pgNumString, '(?<=_)\d (?=\.)', 'IgnoreCase').Value
[regex]::Matches($pgNumString, '(?<=_)\d (?=\.)').Value
As an aside:
- Bringing the functionality of
[regex]::MatchAll()
natively to PowerShell in the future, in the form of a-matchall
operator, is the subject of GitHub issue #7867.
Note that I've modified your regex to use look-around assertions so that what it captures consists solely of the substring to extract, reflected in the .Value
property.
For an explanation of the regex and the ability to experiment with it, see this regex101.com page.
Using your original approach requires extra work to extract the capture-group values, with the help of the intrinsic .ForEach()
method:
[regex]::Matches($pgNumString, '_(\d )\.').ForEach({ $_.Groups[1].Value })
As for what you tried:
As Santiago notes, you need to use ForEach-Object
instead of Select-Object
, but there's an additional requirement:
Given your use of -AllMatches
, you need to access .Groups[1].Value
on each of the matches reported in .Matches
, otherwise you'll only get the first match's capture-group value:
$pgNumString |
Select-String -Pattern $pgNumRegex -AllMatches |
ForEach-Object { $_.Matches.ForEach({ $_.Groups[1].Value }) }
As an aside:
Making
Select-String
only return the matching parts of the input lines / strings, via an-OnlyMatching
switch is a green-lit future enhancement - see GitHub issue #7712While this wouldn't directly help with capture groups, it is usually possible to reformulate regexes with look-around assertions, as shown with
[regex]::Matches()
above.