The sed command is not working with regex-CodePudding

I'm parsing the output of a HTTP GET request with sed to retrieve the contents of a given html tag. The result of that request is like this:

"<!DOCTYPE html><html><body><h1>Hello!</h1><p>v1.0.4-b</p></body></html>"

And I want to retrieve the version number inside the p element.

However, sed seems to have a bug in regex parsing. When I use:

sed 's/.*<p>//'

It correctly replaces the text at the left of the version (i.e., it outputs "v1.0.4-b</p></body></html>"). But, when I try to use regex groups, with

sed 's/.*<p>(.*)<\/p>.*/\1/'

It fails to match and gives an error:

sed: -e expression #1, char 20: invalid reference \1 on `s' command's RHS.

Despite that, when I test the regex on online regex validators it works.

Thank you in advance

CodePudding user response：

You need to use

sed -n 's~.*<p>\([^<]*\)</p>.*~\1~p'
sed -n -E 's~.*<p>([^<]*)</p>.*~\1~p'

See the online demo:

#!/bin/bash
sed -n 's~.*<p>\([^<]*\)</p>.*~\1~p' <<< \
 "<!DOCTYPE html><html><body><h1>Hello!</h1><p>v1.0.4-b</p></body></html>"
## => v1.0.4-b

The sed 's/.*<p>(.*)<\p>.*/\1/' command would not work because

You are using a POSIX BRE pattern where the unescaped ( and ) are treated as literal parentheses chars, not a capturing group. In POSIX BRE, you need \(...\) to define a capturing group (this is why you get the invalid reference \1 exception)
If you add -E option to enable POSIX ERE, you can use (...) to define a capturing group
You are not matching /p, you have \p in the pattern.

As there are slashes in the pattern, it is more convenient to choose regex delimiters other than /, I chose ~ here.

Also, I used -n option to suppress default line output and p flag to print only the result of the substitution.