Home > Software design >  Pattern to include some numbers but not mostly text
Pattern to include some numbers but not mostly text

Time:04-06

I'm currently trying to create a pattern to capture a name that generally is text [A-ZÅÄÖ] but there could be one or two numeric characters.

Here is one example: (there are more fields than this, this is just an example)

PT22  111222333  2 BROTHERS  RP02  570244222333
PT25  888222333  THIS TIME IT'S ONLY TEXT  RP01  888244222333

I'm trying to get 2 BROTHERS and THIS TIME IT'S ONLY TEXT in this tab delimited string (rows are for eached).

When I used the pattern "(\b[A-ZÅÄÖ \&\.\´\:-_] \b)" it captured BROTHERS, if I add 0-9 in the pattern it takes either PT22 or 111222333 if the first field is empty.

So what I would need is something that captures the field that has at least three characters that is A-ZÅÄÖ and maximum 2 characters 0-9.
Is that even possible or do I need to loop through the fields and count each type of character to determine if I'm in the correct field or not?
The fields are not in the same order every time.

My backup plan is to split on tab, and loop. On each field remove all [0-9] characters and count.
That way I will get how many [0-9] and how many [A-ZÅÄÖ] ( other) there is, and that could be used I think to find the correct field.

CodePudding user response:

I am not an regex expert, maybe it is possible, but I usually find it too hard to figure this out - and ending with an regex where you're not 100% sure it fits, plus it is hard to figure out what it does once you come back to it.

I would use a small function that simply count the number of chars and digits of a string. Easy to write, easy to read, easy to adapt. Could look like

Function checkWord(s As String) As Boolean
    Const minChars = 3
    Const maxDigits = 2
    
    Dim i As Long, digits As Long, chars As Long
    For i = 1 To Len(s)
        Select Case AscW(Mid(s, i, 1))
            Case AscW("A") To AscW("Z"), AscW("Ä"), AscW("Ö"), AscW("Å"): 
                chars = chars   1
            Case Asc("0") To Asc("9"): 
                digits = digits   1
            Case Else
                ' Think about what happens with all other characters...
        End Select
    Next
    checkWord = (chars >= minChars And digits <= maxDigits)
    
End Function

CodePudding user response:

Without RegEx:

Can't you turn the logic around and remove parts you don't want:

  • Loop over each value;
  • Your values are tab-delimited so split on VbTab;
  • Loop over each sub-element;
  • Test each sub-element with the Like() operator and check that it's not numeric.

To test this I came up with the following dummy-code:

Sub Test()

Dim r As Variant: r = Array("PT22" & vbTab & "111222333" & vbTab & "2 BROTHERS" & vbTab & "RP02" & vbTab & "570244222333", "PT25" & vbTab & "888222333" & vbTab & "THIS TIME IT'S ONLY TEXT" & vbTab & "RP01" & vbTab & "888244222333")

For Each el In r
    For Each SubEl In Split(el, Chr(9))
        If Not SubEl Like "??[0-9][0-9]" And Not IsNumeric(SubEl) Then
            Debug.Print SubEl
            Exit For
        End If
    Next
Next

End Sub

With RegEx:

If regex is a must, try to capture what you do want:

Sub Test()

Dim r As Variant: r = Array("PT22" & vbTab & "111222333" & vbTab & "2 BROTHERS" & vbTab & "RP02" & vbTab & "570244222333", "PT25" & vbTab & "888222333" & vbTab & "THIS TIME IT'S ONLY TEXT" & vbTab & "RP01" & vbTab & "888244222333")

With CreateObject("vbscript.regexp")
    .Global = True
    .Pattern = "(?:^|\s)((?:[A-ZÅÄÖ] (?:[ &.'_-][A-ZÅÄÖ] )*\s?)?(?:\b\d\d?\b)?(?:\s?[A-ZÅÄÖ] (?:[ &.'_-][A-ZÅÄÖ] )*)?)(?:\s|$)|."
    For Each el In r
        Debug.Print Trim(.Replace(el, "$1"))
    Next
End With

End Sub

Or, again remove only those parts you don't want:

Sub Test()

Dim r As Variant: r = Array("PT22" & vbTab & "111222333" & vbTab & "2 BROTHERS" & vbTab & "RP02" & vbTab & "570244222333", "PT25" & vbTab & "888222333" & vbTab & "THIS TIME IT'S ONLY TEXT" & vbTab & "RP01" & vbTab & "888244222333")

With CreateObject("vbscript.regexp")
    .Global = True
    .Pattern = "\s*\b(?:[A-Z]{2}\d\d|\d{3,})\b\s*"
    For Each el In r
        Debug.Print .Replace(el, "")
    Next
End With

End Sub

Results:

All above options would print:

2 BROTHERS
THIS TIME IT'S ONLY TEXT
  • Related