Home > Net >  How to extract content between CDATA tags in Java
How to extract content between CDATA tags in Java

Time:07-02

I managed to get the JSON-LD structured data of a page using Java matcher and pattern -- between

<script type="application/ld json">
....
</script> 

The result I received is a String:

//<![CDATA[
{"@type":"...", "@context": "...",..}
//]]>

What I'm interested in is the object between this CDATA thing. I want the string to be like "{\"@type\":\"Product\"}" (with the backslashes added because of double quotes). How do I extract and modify it? I tried .charAt(idx) to check how the string is structured but it didn't print out any character.

CodePudding user response:

There's a lot of different ways to do something like this and here is an SO thread that can shed a lot of different ideas. Or you could use something similar to this getBetween() method provided below. It's relatively flexible for many different things:

/**
 * Retrieves any string data located between the supplied string leftString
 * parameter and the supplied string rightString parameter.<br><br>
 * <p>
 * This method will return all instances of a substring located between the
 * supplied Left String and the supplied Right String which may be found
 * within the supplied Input String.<br>
 *
 * @param inputString (String) The string to look for substring(s) in.<br>
 *
 * @param leftString  (String) What may be to the Left side of the substring
 *                    we want within the main input string. Sometimes the
 *                    substring you want may be contained at the very
 *                    beginning of a string and therefore there is no
 *                    Left-String available. In this case you would simply
 *                    pass a Null String ("") to this parameter which
 *                    basically informs the method of this fact. Null can
 *                    not be supplied and will ultimately generate a
 *                    NullPointerException.<br><br>
 *
 * If the leftString is found to be escaped within the inputString then that
 * escape sequence is converted to a "~:L:~" sequence within the
 * inputString. If this new sequence ("~:L:~") is detected within a found
 * substring then it is automatically converted back to it original escaped
 * sequence before it is added to the returned array.<br>
 *
 * @param rightString (String) What may be to the Right side of the
 *                    substring we want within the main input string.
 *                    Sometimes the substring you want may be contained at
 *                    the very end of a string and therefore there is no
 *                    Right-String available. In this case you would simply
 *                    pass a Null String ("") to this parameter which
 *                    basically informs the method of this fact. Null can
 *                    not be supplied and will ultimately generate a
 *                    NullPointerException.<br><br>
 *
 * If the righString is found to be escaped within the inputString then that
 * escape sequence is converted to a "~:R:~" sequence within the
 * inputString. If this new sequence ("~:R:~") is detected within a found
 * substring then it is automatically converted back to it original escaped
 * sequence before it is added to the returned array.<br>
 *
 * @param options     (Optional - Boolean - 2 Parameters):<pre>
 *
 *      ignoreLetterCase    - Default is false. This option works against the
 *                            string supplied within the leftString parameter
 *                            and the string supplied within the rightString
 *                            parameter. If set to true then letter case is
 *                            ignored when searching for strings supplied in
 *                            these two parameters. If left at default false
 *                            then letter case is not ignored.
 *
 *      trimFound           - Default is true. By default this method will trim
 *                            off leading and trailing white-spaces from found
 *                            sub-string items. General sentences which obviously
 *                            contain spaces will almost always give you a white-
 *                            space within an extracted sub-string. By setting
 *                            this parameter to false, leading and trailing white-
 *                            spaces are not trimmed off before they are placed
 *                            into the returned Array.</pre>
 *
 * @return (String[] Array) Returns a Single Dimensional String Array of all
 *         the sub-strings found within the supplied Input String which are
 *         between the supplied Left-String and supplied Right-String.
 */
public static String[] getBetween(String inputString, String leftString, String rightString, boolean... options) {
    // Return null if nothing was supplied.
    if (inputString.isEmpty() || (leftString.isEmpty() && rightString.isEmpty())) {
        return null;
    }

    // Prepare optional parameters if any supplied.
    // If none supplied then use Defaults...
    boolean ignoreCase = false;      // Default.
    boolean trimFound = true;        // Default.
    if (options.length > 0) {
        if (options.length >= 1) {
            ignoreCase = options[0];
            if (options.length >= 2) {
                trimFound = options[1];
            }
        }
    }

    // Remove any control characters from the
    // supplied string (if they exist).
    String modString = inputString.replaceAll("\\p{Cntrl}", "");

    // Establish a List String Array Object to hold
    // our found substrings between the supplied Left
    // String and supplied Right String.
    List<String> list = new ArrayList<>();

    if (modString.contains("\\"   leftString)) {
        modString = modString.replace("\\"   leftString, "~:L:~");
    }
    if (modString.contains("\\"   rightString)) {
        modString = modString.replace("\\"   rightString, "~:R:~");
    }

    // Use Pattern Matching to locate our possible
    // substrings within the supplied Input String.
    String regEx = java.util.regex.Pattern.quote(leftString)   "{1,}"
              (!rightString.isEmpty() ? "(.*?)" : "(.*)?")
              java.util.regex.Pattern.quote(rightString);
    if (ignoreCase) {
        regEx = "(?i)"   regEx;
    }

    java.util.regex.Pattern pattern = java.util.regex.Pattern.compile(regEx);
    java.util.regex.Matcher matcher = pattern.matcher(modString);
    while (matcher.find()) {
        // Add the found substrings into the List.
        String found = matcher.group(1);
        if (trimFound) {
            found = found.trim();
        }
        found = found.replace("~:L:~", "\\"   leftString).replace("~:R:~", "\\"   rightString);
        list.add(found);
    }
    return list.toArray(new String[list.size()]);
}

How this method might be used:

String strg = "//<![CDATA[\n"
              "{\"@type\":\"...\", \"@context\": \"...\",..}\n"
              "//]]>";

String[] data = getBetween(strg, "cdata[", "//]]", true, true);
for (String str : data) {
    System.out.println(str);
}

The Console Window will display:

{"@type":"...", "@context": "...",..}

CodePudding user response:

Use regex:

cdata = str.replaceAll(".*//<!\\[CDATA\\[(.*)//]]>.*", "$1");
  • Related