Home > Software design >  IndexOf with wildcards
IndexOf with wildcards

Time:10-13

I'm trying to find the location of a substring of text within a much larger string that contains question mark wildcard characters. The large string is the results of imprecise OCR software, and it contains wildcards because it could tell there was a character there, but couldn't identify which one.

Here's an oversimplified example of what I'd like to accomplish.

    Dim resultIndex As Integer = -1
    Dim LargeOcrText As String = "fAsD ?GjSDFpG HjDYA?C JLgD FHaYsV MKiI?oL XgXj?GN sHVKgG?"
    Dim searchText As String = "ABC"
    If searchText Like LargeOcrText Then resultIndex = LargeOcrText.IndexOf(searchText)

This should return a resultIndex = 18, but it doesn't work, even if I use searchText = "*ABC*" instead. I'm almost certain there's some way I can use regular expressions to do the Like comparison, but I'm not very practiced with them, and even then I'm at a complete loss for how to get the index of the substring.

Edit: To be clear, I'm aware that neither Like nor IndexOf support what I'm trying to do. That's exactly my problem. I'm searching for some other way to code it that does work.

CodePudding user response:

In your search pattern, replace every letter with [<that letter>?] and feed it to Regex:

Dim resultIndex As Integer = -1
Dim LargeOcrText As String = "fAsD ?GjSDFpG HjDYA?C JLgD FHaYsV MKiI?oL XgXj?GN sHVKgG?"
Dim searchText As String = "[A?][B?][C?]"

With Regex.Match(LargeOcrText, searchText)
    If .Success Then resultIndex = .Index
End With

CodePudding user response:

In addition to GSerg's answer, it is possible to automatically generate the pattern [A?][B?][C?] from ABC.

Here is a working code sample.

Imports System.Linq
Imports System.Text.RegularExpressions

Module Module1

    Sub Main()
        Dim largeOcrText As String = "fAsD ?GjSDFpG HjDYA?C JLgD FHaYsV MKiI?oL XgXj?GN sHVKgG?"
        Dim searchText As String = "ABC"
        Dim index As Integer = GetOcrIndex(largeOcrText, searchText)
        Debug.WriteLine($"Index = {index}")
    End Sub

    Private Function GetOcrIndex(haystack As String, needle As String) As Integer
        Dim pattern As String = BuildPattern(needle)
        Debug.WriteLine($"Pattern = {pattern}")
        Dim match As Match = Regex.Match(haystack, pattern, RegexOptions.IgnoreCase)
        Return If(match.Success, match.Index, -1)
    End Function

    Private Function BuildPattern(needle As String) As String
        Return String.Concat(needle.SelectMany(AddressOf AddWildcard))
    End Function

    Private Function AddWildcard(c As Char) As String
        Return $"[{Regex.Escape(c)}?]"
    End Function

End Module

Output:

Pattern = [A?][B?][C?]
Index = 18
  • Related