Home > Mobile >  How can I use a regex to remove HTML tags from a String?
How can I use a regex to remove HTML tags from a String?

Time:07-08

I'm trying to use String.replaceAll(String regex, String replacement) to filter information out of an HTML document, i.e. HTML code. My aim is to remove all <>-brackets and the contents within them. To do this, I want to simply use an empty String ("") as the replacement String. For example, this:

<tr class='list odd'>
<td  align="center">Do</td>
<td  align="center">7.7.</td><td  align="center">3 - 4</td>
<td  align="center">---</td>
<td  align="center"><s>Q1e14</s></td>
<td  align="center">Arbeitsauftrag:</td>
<td  align="center">entfällt</td></tr>

Should turn into this:

Do
7.7.
3 - 4
---
Q1e14
Arbeitsauftrag
entfällt

I'm completely new to regex and after watching some tutorials I came up with these regexes:

\u003C([a-zA-Z0-9]|\s|\S) 
[\u003C]([a-zA-Z0-9]|\s|\W) \u003E

I built them using this website: https://regexr.com However, while they at least kind of seem to work there, they both result in a StackOverflowError in my code.

(Note that my IDE, IntelliJ, automatically makes each backslash into two backslashes. I think this is just adjusting the JavaScript regex to Java, but I could be wrong.)

TL;DR: How can I replace HTML tags with <>-brackets and their contents with an empty String using replaceAll (or something else if there is an alternative)?

CodePudding user response:

Use a proper HTML-parser like Jsoup, instead of string manipilation or regex. Jsoup provides a very convenient API for extracting and manipulating HTML data and is intuitive to work with. Using Jsoup your code could look like:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class Example2 {
    public static void main(String[] args) {
        String html =
                  "<html>\n"
                  "<head></head>"
                  "<body>"
                  "  <table>"
                  "     <tr class='list odd'>\n"
                  "        <td class=\"list\" align=\"center\">Do</td>\n"
                  "        <td class=\"list\" align=\"center\">7.7.</td><td class=\"list\" align=\"center\">3 - 4</td>\n"
                  "        <td class=\"list\" align=\"center\">---</td>\n"
                  "        <td class=\"list\" align=\"center\"><s>Q1e14</s></td>\n"
                  "        <td class=\"list\" align=\"center\">Arbeitsauftrag:</td>\n"
                  "        <td class=\"list\" align=\"center\">entfällt</td></tr>\n"
                  "   </table>"
                  "</body>\n"
                  "</html>";

        Document doc = Jsoup.parse(html);

        Elements tds = doc.select("td");
        tds.forEach(td -> System.out.println(td.text()));
    }
}

output:

Do
7.7.
3 - 4
---
Q1e14
Arbeitsauftrag:
entfällt

Maven repo:

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.15.2</version>
</dependency>
  • Related