I managed to get the JSON-LD structured data of a page using Java matcher and pattern -- between
<script type="application/ld json">
....
</script>
The result I received is a String:
//<![CDATA[
{"@type":"...", "@context": "...",..}
//]]>
What I'm interested in is the object between this CDATA thing. I want the string to be like "{\"@type\":\"Product\"}"
(with the backslashes added because of double quotes). How do I extract and modify it? I tried .charAt(idx) to check how the string is structured but it didn't print out any character.
CodePudding user response:
There's a lot of different ways to do something like this and here is an SO thread that can shed a lot of different ideas. Or you could use something similar to this getBetween() method provided below. It's relatively flexible for many different things:
/**
* Retrieves any string data located between the supplied string leftString
* parameter and the supplied string rightString parameter.<br><br>
* <p>
* This method will return all instances of a substring located between the
* supplied Left String and the supplied Right String which may be found
* within the supplied Input String.<br>
*
* @param inputString (String) The string to look for substring(s) in.<br>
*
* @param leftString (String) What may be to the Left side of the substring
* we want within the main input string. Sometimes the
* substring you want may be contained at the very
* beginning of a string and therefore there is no
* Left-String available. In this case you would simply
* pass a Null String ("") to this parameter which
* basically informs the method of this fact. Null can
* not be supplied and will ultimately generate a
* NullPointerException.<br><br>
*
* If the leftString is found to be escaped within the inputString then that
* escape sequence is converted to a "~:L:~" sequence within the
* inputString. If this new sequence ("~:L:~") is detected within a found
* substring then it is automatically converted back to it original escaped
* sequence before it is added to the returned array.<br>
*
* @param rightString (String) What may be to the Right side of the
* substring we want within the main input string.
* Sometimes the substring you want may be contained at
* the very end of a string and therefore there is no
* Right-String available. In this case you would simply
* pass a Null String ("") to this parameter which
* basically informs the method of this fact. Null can
* not be supplied and will ultimately generate a
* NullPointerException.<br><br>
*
* If the righString is found to be escaped within the inputString then that
* escape sequence is converted to a "~:R:~" sequence within the
* inputString. If this new sequence ("~:R:~") is detected within a found
* substring then it is automatically converted back to it original escaped
* sequence before it is added to the returned array.<br>
*
* @param options (Optional - Boolean - 2 Parameters):<pre>
*
* ignoreLetterCase - Default is false. This option works against the
* string supplied within the leftString parameter
* and the string supplied within the rightString
* parameter. If set to true then letter case is
* ignored when searching for strings supplied in
* these two parameters. If left at default false
* then letter case is not ignored.
*
* trimFound - Default is true. By default this method will trim
* off leading and trailing white-spaces from found
* sub-string items. General sentences which obviously
* contain spaces will almost always give you a white-
* space within an extracted sub-string. By setting
* this parameter to false, leading and trailing white-
* spaces are not trimmed off before they are placed
* into the returned Array.</pre>
*
* @return (String[] Array) Returns a Single Dimensional String Array of all
* the sub-strings found within the supplied Input String which are
* between the supplied Left-String and supplied Right-String.
*/
public static String[] getBetween(String inputString, String leftString, String rightString, boolean... options) {
// Return null if nothing was supplied.
if (inputString.isEmpty() || (leftString.isEmpty() && rightString.isEmpty())) {
return null;
}
// Prepare optional parameters if any supplied.
// If none supplied then use Defaults...
boolean ignoreCase = false; // Default.
boolean trimFound = true; // Default.
if (options.length > 0) {
if (options.length >= 1) {
ignoreCase = options[0];
if (options.length >= 2) {
trimFound = options[1];
}
}
}
// Remove any control characters from the
// supplied string (if they exist).
String modString = inputString.replaceAll("\\p{Cntrl}", "");
// Establish a List String Array Object to hold
// our found substrings between the supplied Left
// String and supplied Right String.
List<String> list = new ArrayList<>();
if (modString.contains("\\" leftString)) {
modString = modString.replace("\\" leftString, "~:L:~");
}
if (modString.contains("\\" rightString)) {
modString = modString.replace("\\" rightString, "~:R:~");
}
// Use Pattern Matching to locate our possible
// substrings within the supplied Input String.
String regEx = java.util.regex.Pattern.quote(leftString) "{1,}"
(!rightString.isEmpty() ? "(.*?)" : "(.*)?")
java.util.regex.Pattern.quote(rightString);
if (ignoreCase) {
regEx = "(?i)" regEx;
}
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile(regEx);
java.util.regex.Matcher matcher = pattern.matcher(modString);
while (matcher.find()) {
// Add the found substrings into the List.
String found = matcher.group(1);
if (trimFound) {
found = found.trim();
}
found = found.replace("~:L:~", "\\" leftString).replace("~:R:~", "\\" rightString);
list.add(found);
}
return list.toArray(new String[list.size()]);
}
How this method might be used:
String strg = "//<![CDATA[\n"
"{\"@type\":\"...\", \"@context\": \"...\",..}\n"
"//]]>";
String[] data = getBetween(strg, "cdata[", "//]]", true, true);
for (String str : data) {
System.out.println(str);
}
The Console Window will display:
{"@type":"...", "@context": "...",..}
CodePudding user response:
Use regex:
cdata = str.replaceAll(".*//<!\\[CDATA\\[(.*)//]]>.*", "$1");