I'm parsing the output of a HTTP GET request with sed
to retrieve the contents of a given html tag. The result of that request is like this:
"<!DOCTYPE html><html><body><h1>Hello!</h1><p>v1.0.4-b</p></body></html>"
And I want to retrieve the version number inside the p element.
However, sed
seems to have a bug in regex parsing.
When I use:
sed 's/.*<p>//'
It correctly replaces the text at the left of the version (i.e., it outputs "v1.0.4-b</p></body></html>"
). But, when I try to use regex groups, with
sed 's/.*<p>(.*)<\/p>.*/\1/'
It fails to match and gives an error:
sed: -e expression #1, char 20: invalid reference \1 on `s' command's RHS.
Despite that, when I test the regex on online regex validators it works.
Thank you in advance
CodePudding user response:
You need to use
sed -n 's~.*<p>\([^<]*\)</p>.*~\1~p'
sed -n -E 's~.*<p>([^<]*)</p>.*~\1~p'
See the online demo:
#!/bin/bash
sed -n 's~.*<p>\([^<]*\)</p>.*~\1~p' <<< \
"<!DOCTYPE html><html><body><h1>Hello!</h1><p>v1.0.4-b</p></body></html>"
## => v1.0.4-b
The sed 's/.*<p>(.*)<\p>.*/\1/'
command would not work because
- You are using a POSIX BRE pattern where the unescaped
(
and)
are treated as literal parentheses chars, not a capturing group. In POSIX BRE, you need\(...\)
to define a capturing group (this is why you get theinvalid reference \1
exception) - If you add
-E
option to enable POSIX ERE, you can use(...)
to define a capturing group - You are not matching
/p
, you have\p
in the pattern.
As there are slashes in the pattern, it is more convenient to choose regex delimiters other than /
, I chose ~
here.
Also, I used -n
option to suppress default line output and p
flag to print only the result of the substitution.