A have a task to make a maven plugin which takes HTML files in certain location and adds a service attribute to each tag that doesn't have it. This is done on the source code which means my colleagues and I will have to edit those files further.
As a first solution I turned to Jsoup
which seems to be doing the job but has one small yet annoying problem: if we have a tag with multiple long attributes (we often do as this HTML code is a source for further processing) we wrap the lines like this:
<ui:grid id="category_search" title="${handler.getMessage( 'title' )}"
filterListener="onApplyFilter" paginationListener="onPagination" ds="${handler.ds}"
filterFragment="grid_filter" contentFragment="grid_contents"/>
However, Jsoup
turns this into one very long line:
<ui:grid id="category_search" title="${handler.getMessage( 'title' )}" filterListener="onApplyFilter" paginationListener="onPagination" ds="${handler.ds}" filterFragment="grid_filter" contentFragment="grid_contents"/>
Which is a bad practice and real pain to read and edit.
So is there any other not very convoluted way to add this attribute without parsing and recomposing HTML code or maybe somehow preserve line breaks inside the tag?
CodePudding user response:
Unfortunately JSoup's main use case is not to create HTML that is read or edited by humans. Specifically JSoup's API is very closely modeled after DOM which has no way to store or model line breaks inside tags, so it has no way to preserve them.
I can think of only two solutions:
Find (or write) an alternative HTML parser library, that has an API that preserves formatting inside tags. I'd be surprised if such a thing already exists.
Run the generated code through a formatter that supports wrapping inside tags. This won't preserve the original line breaks, but at least the attributes won't be all on one line. I wasn't able to find a Java library that does that, so you may need to consider using an external program.
CodePudding user response:
It seems there is no good way to preserve breaks inside tags while parsing them into POJOs (or I haven't found one), so I wrote a simple tokenizer which splits incoming HTML string into parts sort of like this:
String[] parts = html.split( "((?=<)|(?<=>))" );
This uses regex lookups to split before <
and after >
. Then just iterate over parts and decide whether to insert attribute or not.