Input file
<a href="perl.html">perl</a> <a href="http://zoidberg.sourceforge.net/out.html">http://zoidberg.sourceforge.net</a>
<a href="zoiduser.html">zoiduser</a> <a href="perl.html">perl</a> <a href="http://zoidberg.sourceforge.net/sample.html">http://zoidberg.sourceforge.net</a>
I need to only remove .HTML extension from below URL from above file:
<a href="perl.html">perl</a>
<a href="zoiduser.html">zoiduser</a>
So that the final output should look like:
<a href="perl">perl</a> <a href="http://zoidberg.sourceforge.net/out.html">http://zoidberg.sourceforge.net</a>
<a href="zoiduser">zoiduser</a> <a href="perl.html">perl</a> <a href="http://zoidberg.sourceforge.net/sample.html">http://zoidberg.sourceforge.net</a>
This is what I am doing:
sed '/"http\|"www\|"mailto/ ! s|\(.html\)||g' file
But it ignores the line as soon as it matches the first pattern i.e. avoid URLs that start with "http|"www|"mailto.
CodePudding user response:
You can use
sed -E 's/("(http|www|mailto)[^"]*")|\.html/\1/g' file
Details:
-E
- enables POSIX ERE syntax("(http|www|mailto)[^"]*")
- Group 1 (\1
):"
and then eitherhttp
,www
, ormailto
and then zero or more chars other than"
and then a"
|
- or\.html
-.html
string.
The replacement is Group 1 values.
See the online demo:
#!/bin/bash
s='<a href="perl.html">perl</a> <a href="http://zoidberg.sourceforge.net/out.html">http://zoidberg.sourceforge.net</a>
<a href="zoiduser.html">zoiduser</a> <a href="perl.html">perl</a> <a href="http://zoidberg.sourceforge.net/sample.html">http://zoidberg.sourceforge.net</a>'
sed -E 's/("(http|www|mailto)[^"]*")|\.html/\1/g' <<< "$s"
Output:
<a href="perl">perl</a> <a href="http://zoidberg.sourceforge.net/out.html">http://zoidberg.sourceforge.net</a>
<a href="zoiduser">zoiduser</a> <a href="perl">perl</a> <a href="http://zoidberg.sourceforge.net/sample.html">http://zoidberg.sourceforge.net</a>
CodePudding user response:
It is not recommended to parse HTML using shell utilities like sed, awk, perl etc. But if you really have to use negation of certain keywords then I would suggest this perl
:
perl -pe 's/"(?!www|http|mailto)([^"] )\.html/"\1/g' f.html
<a href="perl">perl</a> <a href="http://zoidberg.sourceforge.net/out.html">http://zoidberg.sourceforge.net</a>
<a href="zoiduser">zoiduser</a> <a href="perl">perl</a> <a href="http://zoidberg.sourceforge.net/sample.html">http://zoidberg.sourceforge.net</a>
(?!www|http|mailto)
is negative lookahead to fail the match if these keywords appear just after "