JSoup here. I have the following HTML I'm trying to parse:
<html><head>
<title>My Soup Materials</title>
<!--mstheme--><link rel="stylesheet" type="text/css" href="../../_themes/ice/ice1011.css"><meta name="Microsoft Theme" content="ice 1011, default">
</head>
<body><center><table width="92%"><tbody>
<tr>
<td><h2>My Soup Materials</h2>
<table width="100%%" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td align="left"><b>Origin:</b> Belgium</td>
<td align="left"><b>Count:</b> 2 foos</td>
</tr>
<tr>
<td align="left"><b>Supplier:</b> </td>
<td align="left"><b>Must Burninate:</b> Yes</td>
</tr>
<tr>
<td align="left"><b>Type:</b> Fizzbuzz</td>
<td align="left"><b>Add Afterwards:</b> No</td>
</tr>
</tbody>
</table>
<br>
<b><u>Notes</b></u><br>Drink more ovaltine</td>
</tr>
</tbody>
</table>
</center></body>
</html>
Unfortunately its actually slightly malformed (missing some closing tags, opening and closing tags on <b>
and <u>
are out of order, etc.) but I'm hoping JSoup can handle that. I don't have control over the HTML.
I have the following Java model/POJO:
@Data // lombok; adds ctors, getters, setters, etc.
public class Material {
private String name;
private String origin;
private String count;
private String supplier;
private Boolean burninate;
private String type;
private Boolean addAfterwards;
}
I am trying to get JSoup to parse this HTML and provide a Material
instance from that parsing.
To grab the data inside the <table>
I'm pretty close:
Material material = new Material();
Elements rows = document.select("table").select("tr");
for (Element row : rows) {
// row 1: origin & count
Elements cols = row.select("td");
for (Element col : cols) {
material.setOrigin(???);
material.setCount(???);
}
}
So I'm able to get each <tr>
, and for each <tr>
get all of its <td>
cols. But where I'm hung up is:
<td align="left"><b>Origin:</b> Belgium</td>
So the col.text()
for the first <td>
would be <b>Origin:</b> Belgium
. How do I tell JSoup that I only want the "Belgium"?
CodePudding user response:
I think you're looking for tdNode.ownText()
. There's also simply text()
, but as the docs state this combines all text nodes of the node and all its children and normalizes them. In other words, tdNode.text()
gives you the string "Origin: Belgium"
. tdNode.ownText()
gives you just "Belgium"
and tdNode.child(0).ownText()
gets you just "Origin:"
.
You can also use wholeText()
, which is non-normalized, but I think you want the normalization here (that primarily involves getting rid of whitespace).