regex for capturing a number with a range of digits in AWK-CodePudding

Im trying to capture numbers inside a file using AWK, I could capture all, but im not being able to capture those in a certain amount of digits. What im doing wrong?

echo -e "$teste" | awk '/_OA/ { match($0,/\[\([:digit:]{4,13}\]/);oa = substr($0,RSTART,RLENGTH);print oa}'

File sample:

_OA ............. [6712227000168]
_OA Tasdsd, OA .. [91][355016]
_OA Tasdsd, DA .. [91][5512987000]

Expected:

6712227000168
355016
5512987000

CodePudding user response：

With your shown samples please try following awk solution. Simply making field separator as ] OR [ and in main block checking condition if line starts from _QA then printing the 2nd last field.

awk -F"[][]" '/^_QA /{print $(NF-1)}'  Input_file

CodePudding user response：

You could update the pattern and the values for RSTART and RLENGTH to not match the leading and trailing square brackets.

The digits part should be [[:digit:]] and there is a \( in the pattern that matches ( that should not be there.

awk '/_OA/ { match($0,/\[[[:digit:]]{4,13}\]/);oa = substr($0,RSTART 1,RLENGTH-2);print oa}' <<< "$teste"

Output

6712227000168
355016
5512987000

As there are multiple occurrences of digits between square brackets, if you want to match multiple occurrences:

teste='_OA Tasdsd, OA .. [91][355016][123456789][1][9999]'

awk '/_OA/ {
  while(match($0,/\[[[:digit:]]{4,13}]/)){
    start=RSTART 1; len=RLENGTH-2
    s=substr($0,start,len)
    res=res?res","s:s    
    $0=substr($0,start len)
  }
  print res
  res = ""
}' <<< "$teste"

Output

355016,123456789,9999

CodePudding user response：

You can use

awk '/_OA/ { match($0,/\[[[:digit:]]{4,13}]/);print substr($0,RSTART 1,RLENGTH-2)}'

See the online demo:

#!/bin/bash
s='_OA ............. [6712227000168]
_OA Tasdsd, OA .. [91][355016]
_OA Tasdsd, DA .. [91][5512987000]'
awk '/_OA/ { match($0,/\[[[:digit:]]{4,13}]/);print substr($0,RSTART 1,RLENGTH-2)}' <<< "$s"

Output:

6712227000168
355016
5512987000

Details:

\[ - a [ char
[[:digit:]]{4,13} - four to thirteen digits (note that the [:digit:] POSIX character class must be used within [...], a bracket expression)
] - a ] char (it is not special, no need escaping)

And substr($0,RSTART 1,RLENGTH-2) means that we

$0 - take the match
RSTART 1 - starting with the second char
RLENGTH-2 - and then as many characters as is the match length - 2 (thus getting rid of enclosing [ and ] chars)

CodePudding user response：

Your regexp \[\([:digit:]{4,13}\] says:

\[ = the literal character [
\( = the literal character (
[:digit:] = a bracket expression containing a character set of the characters :, d, i, g, t
{4,13} = a regexp interval that's 4 to 13 repetitions of the preceding bracket expression
\] = the literal character ]

The 2 main issues with that which are causing your regexp to be unable to match any of your input are:

You don't have any (s in your input (from #2 above), and
To match digits you need a character class [:digit:] inside a bracket expression [[:digit:]], not a character set :digit: inside a bracket expression [:digit:] (from #3 above)

You also don't actually need to escape the ] at the end of the regexp as it's only a regexp metachar (end of bracket expression) if preceded by a matching unescaped [ (start of bracket expression).

So the regexp I think you wanted to write instead would have been:

\[[[:digit:]]{4,13}]

e.g.:

$ awk '/_OA/ { match($0,/\[[[:digit:]]{4,13}]/);oa = substr($0,RSTART,RLENGTH);print oa}' file
[6712227000168]
[355016]
[5512987000]

or to only print the numbers:

$ awk '/_OA/ { match($0,/\[[[:digit:]]{4,13}]/);oa = substr($0,RSTART 1,RLENGTH-2);print oa}' file
6712227000168
355016
5512987000