Home > Enterprise >  sed to ignore a pattern as well as match a pattern in same line
sed to ignore a pattern as well as match a pattern in same line

Time:05-23

Input file

<a href="perl.html">perl</a>     <a href="http://zoidberg.sourceforge.net/out.html">http://zoidberg.sourceforge.net</a>
<a href="zoiduser.html">zoiduser</a>    <a href="perl.html">perl</a>     <a href="http://zoidberg.sourceforge.net/sample.html">http://zoidberg.sourceforge.net</a>

I need to only remove .HTML extension from below URL from above file:

<a href="perl.html">perl</a>
<a href="zoiduser.html">zoiduser</a>

So that the final output should look like:

<a href="perl">perl</a>     <a href="http://zoidberg.sourceforge.net/out.html">http://zoidberg.sourceforge.net</a>
<a href="zoiduser">zoiduser</a>    <a href="perl.html">perl</a>     <a href="http://zoidberg.sourceforge.net/sample.html">http://zoidberg.sourceforge.net</a>

This is what I am doing:

sed '/"http\|"www\|"mailto/ ! s|\(.html\)||g' file

But it ignores the line as soon as it matches the first pattern i.e. avoid URLs that start with "http|"www|"mailto.

CodePudding user response:

You can use

sed -E 's/("(http|www|mailto)[^"]*")|\.html/\1/g' file

Details:

  • -E - enables POSIX ERE syntax
  • ("(http|www|mailto)[^"]*") - Group 1 (\1): " and then either http, www, or mailto and then zero or more chars other than " and then a "
  • | - or
  • \.html - .html string.

The replacement is Group 1 values.

See the online demo:

#!/bin/bash
s='<a href="perl.html">perl</a>     <a href="http://zoidberg.sourceforge.net/out.html">http://zoidberg.sourceforge.net</a>
<a href="zoiduser.html">zoiduser</a>    <a href="perl.html">perl</a>     <a href="http://zoidberg.sourceforge.net/sample.html">http://zoidberg.sourceforge.net</a>'
sed -E 's/("(http|www|mailto)[^"]*")|\.html/\1/g' <<< "$s"

Output:

<a href="perl">perl</a>     <a href="http://zoidberg.sourceforge.net/out.html">http://zoidberg.sourceforge.net</a>
<a href="zoiduser">zoiduser</a>    <a href="perl">perl</a>     <a href="http://zoidberg.sourceforge.net/sample.html">http://zoidberg.sourceforge.net</a>

CodePudding user response:

It is not recommended to parse HTML using shell utilities like sed, awk, perl etc. But if you really have to use negation of certain keywords then I would suggest this perl:

perl -pe 's/"(?!www|http|mailto)([^"] )\.html/"\1/g' f.html

<a href="perl">perl</a>     <a href="http://zoidberg.sourceforge.net/out.html">http://zoidberg.sourceforge.net</a>
<a href="zoiduser">zoiduser</a>    <a href="perl">perl</a>     <a href="http://zoidberg.sourceforge.net/sample.html">http://zoidberg.sourceforge.net</a>

(?!www|http|mailto) is negative lookahead to fail the match if these keywords appear just after "

  • Related