Home > database >  capture repeating regex pattern as one group, sed in bash script
capture repeating regex pattern as one group, sed in bash script

Time:08-09

I wrote a working expression that extracts two pieces of data from valid lines of text. The first capture group is the numerical section including periods. The second is the remaining characters of the line as long as the line is valid. A line is invalid if the numerical section ends with a period or the line ends with a number.

1.1 the quick 1-1 (no match due to ending hypen and number)
11.2 brown fox jumped (should return '11.2' and 'brown fox jumped')
1.41.1 over the lazy (should return '1.41.1' and 'over the lazy')
2.1. dog (no match due to numerical section trailing period)

The expression ^((?:[0-9] \.) [0-9] ) (.*)[^0-9]$ works when tested on various regex testing sites.

My issue is... that I have failed to adapt this expression to work with sed from a bash script that loops through lines of text ($L).

IFS=$'\t' read -r NUM STR < <(sed 's#^\(\(?:[0-9]\ \.\)\ [0-9]\ \) \(.*)[^0-9]$#\1\t\2#p;d' <<< $L )

What does work is below where I replaced the capturing of repeating groups with repeating digits and periods. I would prefer not to do this because it could match lines starting with periods and multiple periods in a row. Also it loses the last char of the captured string but I expect I can figure that part out.

FS=$'\t' read -r NUM STR < <(sed 's#^\([0-9\.]\ [0-9]\ \) \(.*[^0-9]\)$#\1\t\2#p;d' <<< $L )

Please help me understand what I'm doing wrong. Thank you.

CodePudding user response:

An ERE for that would be:

^([0-9] (\.[0-9] )*) (.*[^0-9])$

with \1 and \3 being the capture groups of interest

But I'm not sure that using sed read is the best approach for capturing the data in variables; you could just use bash builtins instead:

#!/bin/bash

while IFS=' ' read -r num str
do
    [[ $num =~ ^([0-9] (\.[0-9] )*)$ && $str =~ [^0-9]$ ]] || continue
    declare -p num str
done < input.txt

There's a side-effect with this solution though: The read will strip the leading, trailing and the first middle space chars of the line.

If you need those spaces then you can match the whole line instead:

#!/bin/bash

regex='^([0-9] (\.[0-9] )*) (.*[^0-9])$'

while IFS='' read -r line
do
    [[ $line =~ $regex ]] || continue
    num=${BASH_REMATCH[1]}
    str=${BASH_REMATCH[3]}
    declare -p num str
done < input.txt
  • Related