I'm trying to find some article that helps me in the work I have to do. I have a table with the questions and answers of a customer service on a website and I would like to create a correlation between cause and effect between them, regardless of the line in which the complaint was made and the question was answered, the idea would be to have a histogram that you give me a direction which are the most used words by the customers in the complaints that are in the STR_COMPLAINTS column and the most used words in the treatment of the STR_ANSWERS in the answers column. As follows example:
ID | STR_COMPLAINTS | STR_ANSWERS |
---|---|---|
1 | My pizza came cold | We will create a temperature control before serving the pizza |
:--: | :------------------- | :-------------- |
2 | My Burger came cold | We will create a temperature control before serving the Burger |
Expected outcome:
WORD | QUESTION | ANSWERS |
---|---|---|
My | 2 | 0 |
pizza | 2 | 2 |
came | 2 | 0 |
cold | 2 | 0 |
We | 0 | 2 |
will | 0 | 2 |
create | 0 | 2 |
a | 0 | 2 |
temperature | 0 | 2 |
control | 0 | 2 |
before | 0 | 2 |
serving | 0 | 2 |
the | 0 | 2 |
Burger | 2 | 2 |
CodePudding user response:
You can token each complain and each answer to get all the individual words, from that get all the unique words, and then count how many times each word appears.
This approach uses regular expressions to tokenize the strings, which can be slow and expensive, but you can modify that part if necessary.
Each column is handled in its own recursive CTE, and then a third CTE gets the unique words. That third CTE, with the list of words, is then outer joined to the first two, and the matching IDs are counted. (You might want to count distinct IDs if you don't want to count repeated words multiple times.)
with q (id, str_complaints, pos, word) as (
select id, str_complaints, 1, regexp_substr(str_complaints, '(\w )', 1, 1, null, 1)
from complaints
union all
select id, str_complaints, pos 1, regexp_substr(str_complaints, '(\w )', 1, pos 1, null, 1)
from q
where regexp_substr(str_complaints, '(\w )', 1, pos 1, null, 1) is not null
),
a (id, str_answers, pos, word) as (
select id, str_answers, 1, regexp_substr(str_answers, '(\w )', 1, 1, null, 1)
from complaints
union all
select id, str_answers, pos 1, regexp_substr(str_answers, '(\w )', 1, pos 1, null, 1)
from a
where regexp_substr(str_answers, '(\w )', 1, pos 1, null, 1) is not null
),
w (word) as (
select word
from q
union
select word
from a
)
select w.word, count(q.id) as question, count(a.id) as answers
from w
left join q on q.word = w.word
left join a on a.word = w.word
group by w.word
WORD | QUESTION | ANSWERS |
---|---|---|
create | 0 | 2 |
before | 0 | 2 |
control | 0 | 2 |
temperature | 0 | 2 |
the | 0 | 2 |
cold | 2 | 0 |
a | 0 | 2 |
pizza | 1 | 1 |
My | 2 | 0 |
came | 2 | 0 |
We | 0 | 2 |
will | 0 | 2 |
Burger | 1 | 1 |
serving | 0 | 2 |
fiddle showing some of the intermediate data.
CodePudding user response:
If you don't have/know/want regexp - you can use user defined function that I use for splitting csv or any other text. Function uses sql modeling and can split any string by the delimiter which can be a blank space too (' '). So, it can split words. Return type is also user defined.
create or replace TYPE STRING_ARRAY AS VARRAY(4000) OF VARCHAR2(1000);
-- ------------------------------------------------------------------
--
create or replace
FUNCTION split ( csvString IN VarChar2, delimiter IN VarChar2 := ',') RETURN STRING_ARRAY
AS
BEGIN
Declare
elements STRING_ARRAY := STRING_ARRAY();
CURSOR c IS
SELECT COL1
FROM ( SELECT INDX, MY_STR1, COL1_ELEMENTS, COL1
FROM
( SELECT 0 "INDX", COL1 "MY_STR1", COL1_ELEMENTS, COL1
FROM
(
SELECT
REPLACE(COL1, delimiter || ' ', delimiter) "COL1",
Trim(Length(Replace(COL1, delimiter || ' ', delimiter))) - Trim(Length(Translate(REPLACE(COL1, delimiter || ' ', delimiter), 'A' || delimiter, 'A'))) 1 "COL1_ELEMENTS"
FROM
(SELECT csvString "COL1" FROM DUAL)
)
)
MODEL
DIMENSION BY(0 as INDX)
MEASURES(COL1, COL1_ELEMENTS, CAST('a' as VarChar2(4000)) as MY_STR1)
RULES ITERATE (1000) -- If you don't know the number of elements this should be bigger then you aproximation. Othewrwise it will split given number of elements
(
COL1_ELEMENTS[ITERATION_NUMBER 1] = COL1_ELEMENTS[0],
MY_STR1[0] = COL1[CV()],
MY_STR1[ITERATION_NUMBER 1] = SubStr(MY_STR1[ITERATION_NUMBER], InStr(MY_STR1[ITERATION_NUMBER], delimiter, 1) Length(delimiter)),
COL1[ITERATION_NUMBER 1] = SubStr(MY_STR1[ITERATION_NUMBER], 1, CASE WHEN InStr(MY_STR1[ITERATION_NUMBER], delimiter) <> 0 THEN InStr(MY_STR1[ITERATION_NUMBER], delimiter)-1 ELSE Length(MY_STR1[ITERATION_NUMBER]) END)
)
)
WHERE INDX > 0 And INDX <= COL1_ELEMENTS;
--
cStr VarChar2(1000);
i Number := 1;
Begin
If c%ISOPEN Then
CLOSE c;
End If;
OPEN c;
LOOP
FETCH c Into cStr;
EXIT WHEN c%NOTFOUND;
elements.extend;
elements(i) := cStr;
i := i 1;
END LOOP;
CLOSE c;
RETURN elements;
End;
END split;
--
-- ------------------------------------------------------------------
with your sample data:
WITH
a_tbl AS
(
Select 1 "ID", 'My pizza came cold' "Q_STR", 'We will create a temperature control before serving the pizza' "A_STR" From Dual Union All
Select 2 "ID", 'My Burger came cold' "Q_STR", 'We will create a temperature control before serving the Burger' "A_STR" From Dual
),
... you can collect all the words from questions and answer - using the split() function create a cte (I named it all_words) like here:
all_words AS
(
Select q.ID "ID", Replace(Replace(Replace(Replace(w.COLUMN_VALUE, ',', ''), '.', ''), '!', ''), '?', '') "Q_WORD", Null "A_WORD"
From a_tbl q
Inner Join TABLE(SPLIT(Q_STR, ' ')) w ON (1 = 1)
Union All
Select a.ID "ID", Null "Q_WORD", Replace(Replace(Replace(Replace(w.COLUMN_VALUE, ',', ''), '.', ''), '!', ''), '?', '') "A_WORD"
From a_tbl a
Inner Join TABLE(SPLIT(a.A_STR, ' ')) w ON (1 = 1)
)
Main sql:
Select
COALESCE(Q_WORD, A_WORD) "WORD",
Count(Q_WORD) "QUESTION",
Count(A_WORD) "ANSWER"
From all_words
Group By COALESCE(Q_WORD, A_WORD)
Order By Count(Q_WORD) DESC
The result should be:
WORD | QUESTION | ANSWER |
---|---|---|
My | 2 | 0 |
cold | 2 | 0 |
came | 2 | 0 |
Burger | 1 | 1 |
pizza | 1 | 1 |
serving | 0 | 2 |
We | 0 | 2 |
will | 0 | 2 |
the | 0 | 2 |
temperature | 0 | 2 |
control | 0 | 2 |
before | 0 | 2 |
create | 0 | 2 |
a | 0 | 2 |
Regards...
P.S.
You should consider making all the words lowercase/upercase before counting because words 'My' and 'my' will be counted separately just for one is at the begining of a sentence and other is not.