How to create a word cloud in Oracle through column string-CodePudding

I'm trying to find some article that helps me in the work I have to do. I have a table with the questions and answers of a customer service on a website and I would like to create a correlation between cause and effect between them, regardless of the line in which the complaint was made and the question was answered, the idea would be to have a histogram that you give me a direction which are the most used words by the customers in the complaints that are in the STR_COMPLAINTS column and the most used words in the treatment of the STR_ANSWERS in the answers column. As follows example:

ID	STR_COMPLAINTS	STR_ANSWERS
1	My pizza came cold	We will create a temperature control before serving the pizza
:--:	:-------------------	:--------------
2	My Burger came cold	We will create a temperature control before serving the Burger

Expected outcome:

WORD	QUESTION	ANSWERS
My	2	0
pizza	2	2
came	2	0
cold	2	0
We	0	2
will	0	2
create	0	2
a	0	2
temperature	0	2
control	0	2
before	0	2
serving	0	2
the	0	2
Burger	2	2

CodePudding user response：

You can token each complain and each answer to get all the individual words, from that get all the unique words, and then count how many times each word appears.

This approach uses regular expressions to tokenize the strings, which can be slow and expensive, but you can modify that part if necessary.

Each column is handled in its own recursive CTE, and then a third CTE gets the unique words. That third CTE, with the list of words, is then outer joined to the first two, and the matching IDs are counted. (You might want to count distinct IDs if you don't want to count repeated words multiple times.)

with q (id, str_complaints, pos, word) as (
  select id, str_complaints, 1, regexp_substr(str_complaints, '(\w )', 1, 1, null, 1)
  from complaints
  union all
  select id, str_complaints, pos   1, regexp_substr(str_complaints, '(\w )', 1, pos   1, null, 1)
  from q
  where regexp_substr(str_complaints, '(\w )', 1, pos   1, null, 1) is not null
),
a (id, str_answers, pos, word) as (
  select id, str_answers, 1, regexp_substr(str_answers, '(\w )', 1, 1, null, 1)
  from complaints
  union all
  select id, str_answers, pos   1, regexp_substr(str_answers, '(\w )', 1, pos   1, null, 1)
  from a
  where regexp_substr(str_answers, '(\w )', 1, pos   1, null, 1) is not null
),
w (word) as (
  select word
  from q
  union
  select word
  from a
)
select w.word, count(q.id) as question, count(a.id) as answers
from w
left join q on q.word = w.word
left join a on a.word = w.word
group by w.word

WORD	QUESTION	ANSWERS
create	0	2
before	0	2
control	0	2
temperature	0	2
the	0	2
cold	2	0
a	0	2
pizza	1	1
My	2	0
came	2	0
We	0	2
will	0	2
Burger	1	1
serving	0	2

fiddle showing some of the intermediate data.

CodePudding user response：

If you don't have/know/want regexp - you can use user defined function that I use for splitting csv or any other text. Function uses sql modeling and can split any string by the delimiter which can be a blank space too (' '). So, it can split words. Return type is also user defined.

create or replace TYPE STRING_ARRAY AS VARRAY(4000) OF VARCHAR2(1000);
--  ------------------------------------------------------------------
--
create or replace 
    FUNCTION split ( csvString  IN  VarChar2, delimiter  IN  VarChar2 := ',') RETURN STRING_ARRAY 
        AS 
    BEGIN
        Declare
            elements     STRING_ARRAY := STRING_ARRAY();
            CURSOR c IS
                SELECT  COL1
                FROM    (   SELECT INDX, MY_STR1, COL1_ELEMENTS, COL1
                            FROM
                                (   SELECT 0 "INDX", COL1 "MY_STR1", COL1_ELEMENTS, COL1
                                    FROM
                                        (
                                            SELECT
                                                REPLACE(COL1, delimiter || ' ', delimiter) "COL1",  
                                                Trim(Length(Replace(COL1, delimiter || ' ', delimiter))) - Trim(Length(Translate(REPLACE(COL1, delimiter || ' ', delimiter), 'A' || delimiter, 'A')))   1 "COL1_ELEMENTS"
                                            FROM 
                                                (SELECT csvString "COL1" FROM DUAL)     
                                        )
                                )
                            MODEL       
                                DIMENSION BY(0 as INDX)
                                MEASURES(COL1, COL1_ELEMENTS, CAST('a' as VarChar2(4000)) as MY_STR1)
                                RULES ITERATE (1000)        -- If you don't know the number of elements this should be bigger then you aproximation. Othewrwise it will split given number of elements
                                (
                                    COL1_ELEMENTS[ITERATION_NUMBER   1] = COL1_ELEMENTS[0],
                                    MY_STR1[0] = COL1[CV()],
                                    MY_STR1[ITERATION_NUMBER   1] = SubStr(MY_STR1[ITERATION_NUMBER], InStr(MY_STR1[ITERATION_NUMBER], delimiter, 1)   Length(delimiter)),
                                    COL1[ITERATION_NUMBER   1] = SubStr(MY_STR1[ITERATION_NUMBER], 1, CASE WHEN InStr(MY_STR1[ITERATION_NUMBER], delimiter) <> 0 THEN InStr(MY_STR1[ITERATION_NUMBER], delimiter)-1 ELSE Length(MY_STR1[ITERATION_NUMBER]) END)
                                )
                        )
                WHERE INDX > 0 And INDX <= COL1_ELEMENTS; 
            --
            cStr    VarChar2(1000);
            i       Number := 1;
        Begin 
            If c%ISOPEN Then 
                CLOSE c; 
            End If;
            OPEN c;
            LOOP
                FETCH c Into cStr;
                EXIT WHEN c%NOTFOUND;
                elements.extend;
                elements(i) := cStr;
                i := i   1;
            END LOOP;
            CLOSE c;
            RETURN elements;
        End;
    END split;
--
--  ------------------------------------------------------------------

with your sample data:

WITH
    a_tbl AS
        (
            Select  1 "ID", 'My pizza  came cold' "Q_STR", 'We will create a temperature control before serving the pizza' "A_STR"  From Dual Union All
            Select  2 "ID", 'My Burger came cold' "Q_STR", 'We will create a temperature control before serving the Burger' "A_STR" From Dual
        ),

... you can collect all the words from questions and answer - using the split() function create a cte (I named it all_words) like here:

    all_words AS
        (
            Select  q.ID "ID", Replace(Replace(Replace(Replace(w.COLUMN_VALUE, ',', ''), '.', ''), '!', ''), '?', '') "Q_WORD", Null "A_WORD"
            From a_tbl q
            Inner Join TABLE(SPLIT(Q_STR, ' ')) w ON (1 = 1)
          Union All
            Select  a.ID "ID", Null "Q_WORD", Replace(Replace(Replace(Replace(w.COLUMN_VALUE, ',', ''), '.', ''), '!', ''), '?', '') "A_WORD"
            From a_tbl a
            Inner Join TABLE(SPLIT(a.A_STR, ' ')) w ON (1 = 1)
        )

Main sql:

Select 
    COALESCE(Q_WORD, A_WORD) "WORD",
    Count(Q_WORD)  "QUESTION",
    Count(A_WORD)  "ANSWER"
From all_words
Group By COALESCE(Q_WORD, A_WORD)
Order By Count(Q_WORD) DESC

The result should be:

WORD	QUESTION	ANSWER
My	2	0
cold	2	0
came	2	0
Burger	1	1
pizza	1	1
serving	0	2
We	0	2
will	0	2
the	0	2
temperature	0	2
control	0	2
before	0	2
create	0	2
a	0	2

Regards...
P.S.
You should consider making all the words lowercase/upercase before counting because words 'My' and 'my' will be counted separately just for one is at the begining of a sentence and other is not.