How can I remove all <a href="file://???">
keep this text</a>
but not the other <a></a>
or </a>
using sed or perl?
Is:
<p><a class="a" href="file://any" id="b">keep this text</a>, <a href="http://example.com/abc">example.com/abc</a>, more text</p>
Should be:
<p>keep this text, <a href="http://example.com/abc">example.com/abc</a>, more text</p>
I have regex like this but it is too greedy and removes all </a>
gsed -E -i 's/<a*href="file:[^>]*>(. ?)<\/a>/\1>/g' file.xhtml
CodePudding user response:
Assumptions:
- OP does not have access to a HTML-centric tool
- remove the
<a href="file:...">
...some_text...</a>
wrappers leaving just...some_text...
- only apply to
file:
entries - input data does not have a line break/feed in the middle of a
file:
entry
Sample data showing multiple file:
entries interspersed with some other (nonsensical) entries:
$ cat sample.html
<p><a href="https:/google.com">some text</a><a href="file://any" >keep this text</a>, <a href="http://example.com/abc">example.com/abc</a>, more text</p><a href="file://anyother" >keep this text,too</a>, last test</p>
One sed
idea to remove the wrappers for all file:
entries:
sed -E 's|<a[^<>] file:[^>] >([^<] )</a>|\1|g' "${infile}"
NOTE: perhaps a bit overkill with some of the [^..]
entries but the key objective is to short circuit sed's
default greedy matching ...
This leaves:
<p><a href="https:/google.com">some text</a>keep this text, <a href="http://example.com/abc">example.com/abc</a>, more text</p>keep this text,too, last test</p>
CodePudding user response:
One way:
sed -E 's,<a[^>]*?href="file://[^>]*>([^<]*)</a>,\1,g'
<a[^>]*?href="file://[^>]*>
match<a
any number of non->
(non-greedy) followed byhref="file://
any number of non->
characters followed by>
([^<]*)
match and capture any number of non-<
characters- match on
</a>
Everything matched is substituted by the capture in \1
and the ending g
makes it do the substitution on every occurance on each line.
Examples:
$ cat data
<p><a class="a" href="file://any" id="b">keep this text</a>, <a id="file:ex" href="http://example.com/abc">example.com/abc</a>, more text</p>
<p><a href="file://any" class="f">keep this text</a>, <a href="http://example.com/abc">example.com/abc</a>, more text</p>
$ sed -E 's,<a[^>]*?href="file://[^>]*>([^<]*)</a>,\1,g' < data
<p>keep this text, <a id="file:ex" href="http://example.com/abc">example.com/abc</a>, more text</p>
<p>keep this text, <a href="http://example.com/abc">example.com/abc</a>, more text</p>
CodePudding user response:
Considering the case the <a>
tag consists of the content in multiple lines,
how about a perl
solution:
perl -0777 -i -pe 's#<a. ?href="?file. ?>(. ?)</a>#$1#gs' file.xhtml
- The
-0777
option tells perl to slurp the whole file. - The
-i
option enables the in-place editing. - The
s
switch at the end ofs
operator makes a dot match any characters including a newline character. - The regex
. ?
is the non-greedy version of.
to enable the shortest match.