I have a list of words I need to search for (say a few thousand entries of varchar2 of no more than 30 characters), I need to search for the presence of these words in sentences (say about a hundred million entries of varchar2 of no more than 256 characters). I would like to get the id of the text with a least one word matching and ideally a list of indexes giving the positions of the searched words.
ID | searched words |
---|---|
1 | pluto |
2 | jupiter |
ID | sentences |
---|---|
1 | we go back to earth |
2 | we discover pluto and jupiter |
would give back the results
minimum results |
---|
2 |
ideal results |
---|
2, ( (1, 13), (2, 23)) |
While this is something that can be developed, it feels like it is a common SQL request. Hence I wonder if there are best practices to do it or even better if there is a dedicated function in oracle SQL starting from 19c or PL/SQL that would do such a thing in an efficient way.
It seems that Oracle Text CONTAINS
and ACCUMulate
would work but I am not sure I can use Oracle Text in my context and if this would be typically slower or faster than a pure SQL PL/SQL request.
CodePudding user response:
I make no claims to performance and I wouldn't run this in a PRODUCTION ENVIORNMENT until vetted and load/performance impacts considered.
- I use 2 CTE's to simulate your data (SearchWords and Sentences)
- I use
instr()
to find the position of each word in a sentence - I use
listAgg()
to combine the data into one row for each word found in a sentence. - I only return occurrences where a word is found in a sentence
- I use
CROSS JOIN
so each search word related to every sentence (this could get UGLY in terms of memory usage CPU etc as the data set will be huge) thousands of words times hundreds of millions of sentences...
This is likely better done using text searches but I'm not sure how I'd get the data format you are looking for that way... shrug if it's a one time thing and you have the time to wait.... and it's in an environment where you won't bring down production....
DEMO: https://dbfiddle.uk/?rdbms=oracle_21&fiddle=77e0b8d9373ee1abc14cf10342c45767
with SearchWords as (SELECT 1 ID , 'pluto' SearchWord from dual UNION ALL
SELECT 2, 'jupiter' from dual),
Sentences as (SELECT 1 ID, 'we go bacvk to earth' sentence from dual UNION ALL
SELECT 2, 'we discover pluto and jupiter' from dual),
Step1 as (SELECT S.ID, LISTAGG('(' || W.ID || ',' || instr(S.Sentence,W.SearchWord) || ')', ',')
WITHIN GROUP (ORDER BY W.ID) Result
FROM Sentences S
CROSS JOIN SearchWords W
WHERE instr(S.Sentence,W.SearchWord)>0
GROUP BY S.ID)
SELECT * FROM Step1
Really don't need the step1 CTE... but I wasn't sure if It was going to work out of the gate.
Giving us:
---- ---------------
| ID | RESULT |
---- ---------------
| 2 | (1,13),(2,23) |
---- ---------------
If needed: You could subdivide the sentences into processing groups to processes some then union in more etc... to manage the hit. But if your environment is sufficiently large it may be able to handle it in one go.