//Method for Strip HTML
public static String stripHtml(String inStr) {
boolean inTag = false;
char c;
StringBuffer outStr = new StringBuffer();
int len = inStr.length();
for (int i = 0; i < len; i ) {
c = inStr.charAt(i);
if (c == '<') {
inTag = true;
}
if (!inTag) {
outStr.append(c);
}
if (c == '>') {
inTag = false;
}
}
//Print to show that the this method is removing the necessary characters
System.out.println(outStr);
return outStr.toString();
}
So I need all outputs containing <> to be cleansed and everything in between it, and it should still print out the remaining characters. for instance
input:app<html>le
expected:apple
however it should also remove if it finds just "<" or ">" but my method isn't doing so.
input:app<le
output:app<le
expected:apple
please let me know what to fix.
CodePudding user response:
Try parsing HTML using an HTML parser like JSoup or TagSoup.
Once you have the DOM, on the root element just call getTextContent()
.
From the API documentation (never versions of Java act the same): This attribute returns the text content of this node and its descendants. [...] no serialization is performed, the returned string does not contain any markup.
See also
CodePudding user response:
It works fine with Jsoup, as someone said.
String input = "app<html>le";
Document doc = Jsoup.parse(input);
System.out.println(doc.wholeText()); // or doc.text()
output:
apple
But the example you gave is not a proper XML document and cannot be processed using XML parsers.
You can also modify your program slightly.
public static String stripHtml(String inStr) {
boolean inTag = false;
StringBuffer outStr = new StringBuffer();
int len = inStr.length();
for (int i = 0; i < len; i ) {
char c = inStr.charAt(i);
if (c == '<') {
inTag = true;
} else if (c == '>') {
inTag = false;
} else if (!inTag) {
outStr.append(c);
}
}
return outStr.toString();
}
and
String input = "app<html>le";
System.out.println(stripHtml(input));
output:
apple
CodePudding user response:
Your requirement is to remove a paired <...>
and not handle sole <
s.
This means that your code may only drop the in-tag characters when encountering >
-
Your code could also use ìnt i2 = inStr.indexOf('>', i 1);
to find a closing >
at <
.
However simpler is to use a regular expression replace:
public static String stripHtml(String s) {
return s.replaceAll("<[^>]*>", "");
}
This searches all:
<
- a not-
>
, 0 or more times (*
) >