Home > OS >  How to prefer use of an optional part of a regular expression?
How to prefer use of an optional part of a regular expression?

Time:09-26

Is there a way of telling a regular expression (specifically sed) to prefer using an optional component when the input also matches without using that component?

I'm trying to extract a number from a string that may optionally be preceded by prefix. It works in the following cases:

echo dummy/123456/dummy | sed "s:.*/\(prefix\)\?\([0-9]\{3,\}\)/.*:\2:"
123456

echo dummy/prefix123456/dummy | sed "s:.*/\(prefix\)\?\([0-9]\{3,\}\)/.*:\2:"
123456

but if the string contains both a prefixed number and a "bare" number, it choses the bare number:

echo dummy/prefix123456/987654/dummy | sed "s:.*/\(prefix\)\?\([0-9]\{3,\}\)/.*:\2:"
987654

Is there a way of forcing sed to prefer the match including the prefix (123456)? All search results I've found talk of greedy/lazy options, which – as far as I can tell – don't apply here.

Clarifications

  • The dummy portions in the examples above may contain slashes.

  • The bit I'm interested in is either the first slash-delimited run of three or more digits (.../123456/...) or the first slash-delimited run of 3 digits with a prefix (.../prefix123456/...), whichever occurs first.

CodePudding user response:

You may try this sed command:

sed '
/.*\/prefix\([0-9]\{3,\}\)\/.*/{
    s//\1/
    b
}
s/.*\/\([0-9]\{3,\}\)\/.*/\1/
' file

which will print out

123456
123456
123456
123456

where the content of file is

dummy/123456/dummy
dummy/prefix123456/dummy
dummy/prefix123456/987654/dummy
dummy/987654/prefix123456/dummy

CodePudding user response:

sed BRE or ERE doesn't have a way to use lazy quantifier in starting .*?.

However, based on your use-cases, you may use this sed:

sed -E 's~[^/]*/(prefix){0,1}([0-9]{3,})/.*~\2~' file

123456
123456
123456

where input is:

cat file

dummy/123456/dummy
dummy/prefix123456/dummy
dummy/prefix123456/987654/dummy

Here we are using negated character class (bracket expression) [^/]* instead of .* to allow pattern to match 0 or more of any char that is not a /.


If you can consider perl then .*? with a negative lookahead will work for you:

perl -pe 's~^.*?/(?:prefix)?(\d{3,})(?!.*prefix\d{3}).*~$1~' file

RegEx Demo

CodePudding user response:

With GNU awk you could try following code. Written and tested with shown samples only.

awk 'match($0,/\/(prefix){0,1}([0-9] )/,arr){print arr[2]}' Input_file

Explanation: Simple explanation would be, using GNU awk's match function. In it using regex (prefix){0,1}([0-9] ) which is having 2 capturing groups and its matched values are getting stored into array named arr and if condition is fine then printing 2nd element of that array.

CodePudding user response:

Using sed

$ sed -E 's/[^0-9]*(prefix)?([0-9]{3,}).*/\2/' input_file
123456

  • Related