Using PowerShell extract all Capitalized words from a Document. Everything works until the last line of code as far as I can tell. Something wrong with my RegEx or is my approach all wrong?
#Extract content of Microsoft Word Document to text
$word = New-Object -comobject Word.Application
$word.Visible = $True
$doc = $word.Documents.Open("D:\Deleteme\test.docx")
$sel = $word.Selection
$paras = $doc.Paragraphs
$path = "D:\deleteme\words.txt"
foreach ($para in $paras)
{
$para.Range.Text | Out-File -FilePath $path -Append
}
#Find all capitalized words :( Everything works except this. I want to extract all Capitalized words
$capwords = Get-Content $path | Select-string -pattern "/\b[A-Z] \b/g"
CodePudding user response:
I modified your script and was able to get all the upper-case words in my test doc.
$word = New-Object -comobject Word.Application
$word.Visible = $True
$doc = $word.Documents.Open("D:\WordTest\test.docx")
$sel = $word.Selection
$paras = $doc.Paragraphs
$path = "D:\WordTest\words.txt"
foreach ($para in $paras)
{
$para.Range.Text | Out-File -FilePath $path -Append
}
# Get all words in the content
$AllWords = (Get-Content $path)
# Split all words into an array
$WordArray = ($AllWords).split(' ')
# Create array for capitalized words to capture them during ForEach loop
$CapWords = @()
# ForEach loop for each word in the array
foreach($SingleWord in $WordArray){
# Perform a check to see if the word is fully capitalized
$Check = $SingleWord -cmatch '\b[A-Z] \b'
# If check is true, remove special characters and put it into the $CapWords array
if($Check -eq $True){
$SingleWord = $SingleWord -replace '[\W]', ''
$CapWords = $SingleWord
}
}
I had it come out as an array of capitalized words, but you could always join it back if you wanted it to be a string:
$CapString = $CapWords -join " "
CodePudding user response:
PowerShell uses strings to store regexes and has no syntax for regex literals such as
/.../
- nor for post-positional matching options such asg
.PowerShell is case-insensitive by default and requires opt-in for case-sensitivity (
-CaseSensitive
in the case ofSelect-String
).- Without that,
[A-Z]
is effectively the same as[A-Za-z]
and therefore matches both upper- and lowercase (English) letters.
- Without that,
The equivalent of the
g
option isSelect-String
's-AllMatches
switch, which looks for all matches on each input line (by default, it only looks for the first.What
Select-String
outputs aren't strings, i.e. not the matching lines directly, but wrapper objects of type[Microsoft.PowerShell.Commands.MatchInfo]
with metadata about each match.- Instances of that type have a
.Matches
property that contains array of[System.Text.RegularExpressions.Match]
instances, whose.Value
property contains the text of each match (whereas the.Line
property contains the matching line in full).
- Instances of that type have a
To put it all together:
$capwords = Get-Content -Raw $path |
Select-String -CaseSensitive -AllMatches -Pattern '\b[A-Z] \b' |
ForEach-Object { $_.Matches.Value }
Note the use of -Raw
with Get-Content
, which greatly speeds up processing, because the entire file content is read as a single, multi-line string - essentially, Select-String
then sees the entire content as a single "line". This optimization is possible, because you're not interested in line-by-line processing and only care about what the regex captured, across all lines.
As an aside:
$_.Matches.Value
takes advantage of PowerShell's member enumeration, which you can similarly leverage to avoid having to loop over the paragraphs in $paras
explicitly:
# Use member enumeration on collection $paras to get the .Range
# property values of all collection elements and access their .Text
# property value.
$paras.Range.Text | Out-File -FilePath $path
.NET API alternative:
The [regex]::Matches()
.NET method allows for a more concise - and better-performing - alternative:
$capwords = [regex]::Matches((Get-Content -Raw $path), '\b[A-Z] \b').Value
Note that, in contrast with PowerShell, the .NET regex APIs are case-sensitive by default, so no opt-in is required.
.Value
again utilizes member enumeration in order to extract the matching text from all returned match-information objects.