Home > Mobile >  Extract all Capitalized words from document using PowerShell
Extract all Capitalized words from document using PowerShell

Time:12-08

Using PowerShell extract all Capitalized words from a Document. Everything works until the last line of code as far as I can tell. Something wrong with my RegEx or is my approach all wrong?

#Extract content of Microsoft Word Document to text
$word = New-Object -comobject Word.Application
$word.Visible = $True 
$doc = $word.Documents.Open("D:\Deleteme\test.docx") 
$sel = $word.Selection
$paras = $doc.Paragraphs

$path = "D:\deleteme\words.txt"

foreach ($para in $paras) 
{ 
    $para.Range.Text | Out-File -FilePath $path -Append
}

#Find all capitalized words :( Everything works except this. I want to extract all Capitalized words
$capwords = Get-Content $path | Select-string -pattern "/\b[A-Z] \b/g" 

CodePudding user response:

I modified your script and was able to get all the upper-case words in my test doc.

$word = New-Object -comobject Word.Application
$word.Visible = $True 
$doc = $word.Documents.Open("D:\WordTest\test.docx") 
$sel = $word.Selection
$paras = $doc.Paragraphs

$path = "D:\WordTest\words.txt"

foreach ($para in $paras) 
{ 
    $para.Range.Text | Out-File -FilePath $path -Append
}

# Get all words in the content
$AllWords = (Get-Content $path)

# Split all words into an array
$WordArray = ($AllWords).split(' ')

# Create array for capitalized words to capture them during ForEach loop
$CapWords = @()

# ForEach loop for each word in the array
foreach($SingleWord in $WordArray){

    # Perform a check to see if the word is fully capitalized
    $Check = $SingleWord -cmatch '\b[A-Z] \b'
    
    # If check is true, remove special characters and put it into the $CapWords array
    if($Check -eq $True){
        $SingleWord = $SingleWord -replace '[\W]', ''
        $CapWords  = $SingleWord
    }
}

I had it come out as an array of capitalized words, but you could always join it back if you wanted it to be a string:

$CapString = $CapWords -join " "

CodePudding user response:

  • PowerShell uses strings to store regexes and has no syntax for regex literals such as /.../ - nor for post-positional matching options such as g.

  • PowerShell is case-insensitive by default and requires opt-in for case-sensitivity (-CaseSensitive in the case of Select-String).

    • Without that, [A-Z] is effectively the same as [A-Za-z] and therefore matches both upper- and lowercase (English) letters.
  • The equivalent of the g option is Select-String's -AllMatches switch, which looks for all matches on each input line (by default, it only looks for the first.

  • What Select-String outputs aren't strings, i.e. not the matching lines directly, but wrapper objects of type [Microsoft.PowerShell.Commands.MatchInfo] with metadata about each match.

    • Instances of that type have a .Matches property that contains array of [System.Text.RegularExpressions.Match] instances, whose .Value property contains the text of each match (whereas the .Line property contains the matching line in full).

To put it all together:

$capwords = Get-Content -Raw $path |
  Select-String -CaseSensitive -AllMatches -Pattern '\b[A-Z] \b' |
    ForEach-Object { $_.Matches.Value }

Note the use of -Raw with Get-Content, which greatly speeds up processing, because the entire file content is read as a single, multi-line string - essentially, Select-String then sees the entire content as a single "line". This optimization is possible, because you're not interested in line-by-line processing and only care about what the regex captured, across all lines.

As an aside:

$_.Matches.Value takes advantage of PowerShell's member enumeration, which you can similarly leverage to avoid having to loop over the paragraphs in $paras explicitly:

# Use member enumeration on collection $paras to get the .Range
# property values of all collection elements and access their .Text
# property value.
$paras.Range.Text | Out-File -FilePath $path

.NET API alternative:

The [regex]::Matches() .NET method allows for a more concise - and better-performing - alternative:

$capwords = [regex]::Matches((Get-Content -Raw $path), '\b[A-Z] \b').Value

Note that, in contrast with PowerShell, the .NET regex APIs are case-sensitive by default, so no opt-in is required.

.Value again utilizes member enumeration in order to extract the matching text from all returned match-information objects.

  • Related