Regular expression to match and extract two version number patterns-CodePudding

and thanks for taking the time to read my question.

I am trying to write a regular expression that will match the version number from a configuration file. I am trying to match and extract the version number from the two following numbering patterns

1) <version>2.343</version>
2) <version>2.343.2</version>

Such that a result is returned of either

1) 2.343
2) 2.343.2

My current solution- looks like either one of these two awk commands with the regex pattern to match both cases individually. But there must be a solution that covers both cases?

awk 'match($0, /[0-9][.][0-9][0-9][0-9]/) {print substr($0, RSTART, RLENGTH) }' config.xml
awk 'match($0, /[0-9][.][0-9][0-9][0-9].[0-9]/) {print substr($0, RSTART, RLENGTH) }' config.xml

CodePudding user response：

1st solution: With your shown samples please try following. Using match function of awk here, should work in any POSIX awk version. Using regex >[0-9] (\.[0-9] )*< to match values from > followed by version followed by > and if regex match is found then printing sub string of matched values.

awk 'match($0,/>[0-9] (\.[0-9] )*</){print substr($0,RSTART 1,RLENGTH-2)}' Input_file

OR In case you want to exactly looking for version tag then try following:

awk 'match($0,/<version>[0-9] (\.[0-9] )*<\/version>/){print substr($0,RSTART 9,RLENGTH-19)}'  Input_file

2nd solution: With your shown samples. Using GNU awk's RS variable with same concept of using regex in it and getting values.

awk -v RS='<version>[0-9] (\\.[0-9] )*<\\/version>' 'RT{split(RT,arr,"[><]");print arr[3]}' Input_file

CodePudding user response：

You may use:

awk 'match($0, /[0-9] (\.[0-9] ) /) {
   print $0, substr($2, RSTART, RLENGTH)}' file

1) 2.343
2) 2.343.2

CodePudding user response：

Using GNU awk and the third argument of match():

$ gawk 'match($0,/<version>(.*)<\/version>/,a){print a[1]}' file
2.343
2.343.2

CodePudding user response：

Your two commands might be melded into one using ? meaning zero-or-one repetitions as follows

awk 'match($0, /[0-9][.][0-9][0-9][0-9](.[0-9])?/) {print substr($0, RSTART, RLENGTH) }' config.xml

which for config.xml content as follows

1) <version>2.343</version>
2) <version>2.343.2</version>

gives output

2.343
2.343.2

(tested in gawk 4.2.1)

CodePudding user response：

absolutely no need to invoke match() or resort to vendor-proprietary solutions

nawk   NF OFS='' FS='(^[^>]*)?[<][/]?version[>]($)?'

2.343
2.343.2

the brute-force approaches :

gawk NF=NF OFS= FS='^[^>] >|<[/]. $'  # kinda brute-force

mawk NF   OFS= FS='^[^>] .|./. $'     # REALLY brute-force

2.343
2.343.2

CodePudding user response：

Here is another awk solution (tested with GNU and BSD awk) that tries to match exactly the two numbering patterns shown in the OP (<version>N.NNN</version> and <version>N.NNN.N</version> where N is any digit). It assumes that <version>...</version> tags are properly balanced, do not appear in comments, strings... and do not span over multiple lines. If several version numbers appear on the same line they are all printed.

awk -F '</?version>' '{
  for(i=1; i<=NF/2; i  )
    if($(2*i) ~ /^[0-9]\.[0-9]{3}(\.[0-9])?$/) print $(2*i)
}' config.xml

If the components of version numbers can have any number of digits (minimum 1) just relax the regular expression: /^[0-9] (\.[0-9] ){1,2}$/. And if there can be any number of components (minimum 1) relax a bit more: /^[0-9] (\.[0-9] )*$/ (or /^[0-9] (\.[0-9] ) $/ for at least 2 components).

If <version>...</version> tags are not properly balanced, can appear in comments, or can span over several lines, a real XML parser would be a much better solution than a general purpose utility like awk.