I'm trying to find the location of a substring of text within a much larger string that contains question mark wildcard characters. The large string is the results of imprecise OCR software, and it contains wildcards because it could tell there was a character there, but couldn't identify which one.
Here's an oversimplified example of what I'd like to accomplish.
Dim resultIndex As Integer = -1
Dim LargeOcrText As String = "fAsD ?GjSDFpG HjDYA?C JLgD FHaYsV MKiI?oL XgXj?GN sHVKgG?"
Dim searchText As String = "ABC"
If searchText Like LargeOcrText Then resultIndex = LargeOcrText.IndexOf(searchText)
This should return a resultIndex = 18, but it doesn't work, even if I use searchText = "*ABC*" instead. I'm almost certain there's some way I can use regular expressions to do the Like comparison, but I'm not very practiced with them, and even then I'm at a complete loss for how to get the index of the substring.
Edit: To be clear, I'm aware that neither Like nor IndexOf support what I'm trying to do. That's exactly my problem. I'm searching for some other way to code it that does work.
CodePudding user response:
In your search pattern, replace every letter with [<that letter>?]
and feed it to Regex:
Dim resultIndex As Integer = -1
Dim LargeOcrText As String = "fAsD ?GjSDFpG HjDYA?C JLgD FHaYsV MKiI?oL XgXj?GN sHVKgG?"
Dim searchText As String = "[A?][B?][C?]"
With Regex.Match(LargeOcrText, searchText)
If .Success Then resultIndex = .Index
End With
CodePudding user response:
In addition to GSerg's answer, it is possible to automatically generate the pattern [A?][B?][C?]
from ABC
.
Here is a working code sample.
Imports System.Linq
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim largeOcrText As String = "fAsD ?GjSDFpG HjDYA?C JLgD FHaYsV MKiI?oL XgXj?GN sHVKgG?"
Dim searchText As String = "ABC"
Dim index As Integer = GetOcrIndex(largeOcrText, searchText)
Debug.WriteLine($"Index = {index}")
End Sub
Private Function GetOcrIndex(haystack As String, needle As String) As Integer
Dim pattern As String = BuildPattern(needle)
Debug.WriteLine($"Pattern = {pattern}")
Dim match As Match = Regex.Match(haystack, pattern, RegexOptions.IgnoreCase)
Return If(match.Success, match.Index, -1)
End Function
Private Function BuildPattern(needle As String) As String
Return String.Concat(needle.SelectMany(AddressOf AddWildcard))
End Function
Private Function AddWildcard(c As Char) As String
Return $"[{Regex.Escape(c)}?]"
End Function
End Module
Output:
Pattern = [A?][B?][C?]
Index = 18