I have a simple function to convert html blob to plain text
FUNCTION HTML_TO_TEXT(html IN CLOB) RETURN CLOB
IS v_return CLOB;
BEGIN
select utl_i18n.unescape_reference(regexp_replace(html, '<. ?>', ' ')) INTO v_return from dual;
return (v_return);
END;
called in that way:
SELECT A, B, C, HTML_TO_TEXT(BLobField) FROM t1
all works fine until BlobFields contains more than 4000 character, then i got
ORA-01704: string literal too long
01704. 00000 - "string literal too long"
*Cause: The string literal is longer than 4000 characters.
*Action: Use a string literal of at most 4000 characters.
Longer values may only be entered using bind variables.
i try to avoud string inside function using variables but nothing changes:
FUNCTION HTML_TO_TEXT(html IN CLOB) RETURN CLOB
IS v_return CLOB;
"stringa" CLOB;
BEGIN
SELECT regexp_replace(html, '<. ?>', ' ') INTO "stringa" FROM DUAL;
select utl_i18n.unescape_reference("stringa") INTO v_return from dual;
return (v_return);
END;
CodePudding user response:
Do not use regular expressions to parse HTML. If you want to extract the text then use an XML parser:
SELECT a,
b,
c,
UTL_I18N.UNESCAPE_REFERENCE(
XMLQUERY(
'//text()'
PASSING XMLTYPE(blobfield, 1)
RETURNING CONTENT
).getStringVal()
) AS text
FROM t1
Which will work where the extracted text is 4000 characters or less (since XMLTYPE.getStringVal()
will return a VARCHAR2
data type and UTL_I18N.UNESCAPE_REFERENCE
accepts a VARCHAR2
argument).
If you want to get it to work on CLOB
values then you can still use XMLQUERY
and getClobVal()
but UTL_I18N.UNESCAPE_REFERENCE
still only works on VARCHAR2
input (and not CLOB
s) so you will need to split the CLOB
into segments and parse those and concatenate them once you are done.
Something like:
CREATE FUNCTION html_to_text(
i_xml IN XMLTYPE
) RETURN CLOB
IS
v_text CLOB;
v_output CLOB;
str VARCHAR2(4000);
len PLS_INTEGER;
pos PLS_INTEGER := 1;
lim CONSTANT PLS_INTEGER := 4000;
BEGIN
SELECT XMLQUERY(
'//text()'
PASSING i_xml
RETURNING CONTENT
).getStringVal()
INTO v_text
FROM DUAL;
len := LENGTH(v_text);
WHILE pos <= len LOOP
str := DBMS_LOB.SUBSTR(v_text, lim, pos);
v_output := v_output || UTL_I18N.UNESCAPE_REFERENCE(str);
pos := pos lim;
END LOOP;
RETURN v_output;
END;
/
However, you probably want to make it more robust and check if you are going to split the string in the middle of an escaped XML character.
db<>fiddle here