Home > Enterprise >  Traversing nested HTML lists recursively
Traversing nested HTML lists recursively

Time:06-13

I am unable to take the HTML below and put it into a list like:

List<String> output = Arrays.asList(
    new String[] {
        "First Level-Second Level--Third Level",
        "a-b--c1",
        "a-b--c2",
        "a-b--c3",
        "a-b2--c1",
        "a-b2--c2",
        "a-b2--c3"
    });

<ul>
    <li>First Level</li>
    <ul>
        <li>Second Level</li>
        <ul>
            <li>Third Level</li>
        </ul>
    </ul>
    <li>a</li>
    <ul>
        <li>b</li>
        <ul>
            <li>c</li>
            <li>c</li>
            <li>c</li>
        </ul>
        <li>b2</li>
        <ul>
            <li>c1</li>
            <li>c2</li>
            <li>c3</li>
        </ul>
    </ul>
</ul>

I have loaded it into Jsoup to go through them like elements. The code I have is below:

public static String recHTML(Element element, String str) { 
    str  = "-"   element.ownText(); 
    if (element.children().size() == 0) return str   "--"   element.ownText(); 
    else { 
        for (int i = 0; i < element.children().size(); i  ) { 
            Element next = element.child(i); 
            str  = recHTML(next, str); 
        }
    }
}

I am returning the string with the dashes to use regex to split them into an array to have different levels of indentation. I am struggling to get my output to match up and no matter what I try I just can't get this to work. Any help would be appreciated, thank you.

CodePudding user response:

Suppose you create a program stub like bellow:


import java.util.ArrayList;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;

public class App {
    private static List<String> output = new ArrayList<String>();
    public static void main(String[] args) {
        Element input = Jsoup.parse(/*your HTML list*/).selectFirst("body").child(0);
        recHTML(input, "", "");
        System.out.println(output);
    }

    public static void recHTML(Element element, String chain, String link) {
        // Todo! 
    }
}

Now let's consider your invalid HTML example and its valid equivalent

Invalid HTML input list

<ul>
    <li>First Level</li>
    <ul>
        <li>Second Level</li>
        <ul>
            <li>Third Level</li>
        </ul>
    </ul>
    <li>a</li>
    <ul>
        <li>b</li>
        <ul>
            <li>c</li>
            <li>c</li>
            <li>c</li>
        </ul>
        <li>b2</li>
        <ul>
            <li>c1</li>
            <li>c2</li>
            <li>c3</li>
        </ul>
    </ul>
</ul>

Currently, you're trying to implement something like a depth-first search (DFS). A typical DFS won't do. Why? Because the HTML structure you provided has multiple leaf nodes that do not terminate a chain.

Suppose we call each element of your output a chain. As you traverse the HTML tree in a DFS manner, note that a chain is complete as soon as you find a li element without ul siblings. This should be your end condition. As for the recursion step, think about how to handle ul and li elements. It is not easy to define; currently, my solution only works for your specific input.

Your recHTML implementation for this is bound to be ugly since there's no direct nested chain from the root to leaf nodes. New edge cases will continuously come up where your code will not work and you'll keep making edits.

public static void recHTML(Element element, String chain, String link) {
    if (element.parent().children().select("ul").isEmpty()) {
        output.add(chain);
    }

    for (int i = 0; i < element.children().size(); i  ) {
        Element next = element.child(i);
        if (next.tagName().equals("ul")) {
            recHTML(next, chain    link   element.child(i - 1).ownText(), link   '-');
        } else {
            recHTML(next, chain    link   next.ownText(), link);
        }
    }
}

Valid HTML input list

<ul>
  <li>
    First level
    <ul>
      <li>
        Second Level
        <ul>
          <li>Third Level</li>
        </ul>
      </li>
    </ul>
  </li>
  <li>
    a
    <ul>
      <li>
        b
        <ul>
          <li>c</li>
          <li>c</li>
          <li>c</li>
        </ul>
      </li>
      <li>
        b2
        <ul>
          <li>c1</li>
          <li>c2</li>
          <li>c3</li>
        </ul>
      </li>
    </ul>
  </li>
</ul>

As suggested in my comments, your problem lies in the input. Suppose we somehow transform your input HTML to the above markup. Then the solution becomes a lot cleaner and more straightforward - a simple DFS will work!

public static void recHTML(Element element, String chain, String link) { 
    if (element.children().isEmpty()) {
        output.add(chain);
    }

    for (Element child : element.children()) {
        if (child.tagName().equals("li")) {
            recHTML(child, chain   link   child.ownText(), link   "-");
        } else {
            recHTML(child, chain, link);
        }
    }
}

Fun fact: Java HTML tidy tool JTidy has corrected your HTML input in a slightly different way than I have:

<!-- original snippet -->
<li>First Level</li>
<ul>
  <li>Second Level</li>
  <!-- omitted elements for brevity -->
</ul>

<!-- snippet, corrected by jtidy -->
<li>First Level</li>
<li style="list-style: none">
  <ul>
    <li>Second Level</li>
    <!-- omitted elements for brevity -->
  </ul>
</li>

Personally, I don't think it corrects it the right way - Proper way to make HTML nested list?

  • Related