Home > Net >  Oracle remove html from clob fields
Oracle remove html from clob fields

Time:06-22

I have a simple function to convert html blob to plain text

 FUNCTION HTML_TO_TEXT(html IN CLOB) RETURN CLOB
 IS v_return CLOB;
   BEGIN     
select utl_i18n.unescape_reference(regexp_replace(html, '<. ?>', ' ')) INTO v_return from dual;
return (v_return);
    END;

called in that way:

SELECT A, B, C, HTML_TO_TEXT(BLobField) FROM t1

all works fine until BlobFields contains more than 4000 character, then i got

ORA-01704: string literal too long
01704. 00000 -  "string literal too long"
*Cause:    The string literal is longer than 4000 characters.
*Action:   Use a string literal of at most 4000 characters.
           Longer values may only be entered using bind variables.

i try to avoud string inside function using variables but nothing changes:

FUNCTION HTML_TO_TEXT(html IN CLOB) RETURN CLOB
 IS v_return CLOB;
 "stringa" CLOB;
   BEGIN 
    SELECT regexp_replace(html, '<. ?>', ' ') INTO "stringa" FROM DUAL;
    select utl_i18n.unescape_reference("stringa") INTO v_return from dual;
    return (v_return);
END;

CodePudding user response:

Do not use regular expressions to parse HTML. If you want to extract the text then use an XML parser:

SELECT a,
       b,
       c,
       UTL_I18N.UNESCAPE_REFERENCE(
         XMLQUERY(
           '//text()'
           PASSING XMLTYPE(blobfield, 1)
           RETURNING CONTENT
         ).getStringVal()
       ) AS text
FROM   t1

Which will work where the extracted text is 4000 characters or less (since XMLTYPE.getStringVal() will return a VARCHAR2 data type and UTL_I18N.UNESCAPE_REFERENCE accepts a VARCHAR2 argument).


If you want to get it to work on CLOB values then you can still use XMLQUERY and getClobVal() but UTL_I18N.UNESCAPE_REFERENCE still only works on VARCHAR2 input (and not CLOBs) so you will need to split the CLOB into segments and parse those and concatenate them once you are done.

Something like:

CREATE FUNCTION html_to_text(
  i_xml IN XMLTYPE
) RETURN CLOB
IS
  v_text   CLOB;
  v_output CLOB;
  str      VARCHAR2(4000);
  len      PLS_INTEGER;
  pos      PLS_INTEGER := 1;
  lim      CONSTANT PLS_INTEGER := 4000;
BEGIN
  SELECT XMLQUERY(
           '//text()'
           PASSING i_xml
           RETURNING CONTENT
         ).getStringVal()
  INTO   v_text
  FROM   DUAL;
  
  len := LENGTH(v_text);
  WHILE pos <= len LOOP
    str := DBMS_LOB.SUBSTR(v_text, lim, pos);
    v_output := v_output || UTL_I18N.UNESCAPE_REFERENCE(str);
    pos := pos   lim;
  END LOOP;
  
  RETURN v_output;
END;
/

However, you probably want to make it more robust and check if you are going to split the string in the middle of an escaped XML character.

db<>fiddle here

  • Related