Premise: Suppose you have a table containing words, where some may be distinct and some "may overlap", meaning a longer word starts with a shorter one, eg:
---------------
| word |
---------------
| dog | *
| games | *
| stat |
| state |
| statement | *
| fulfill |
| fulfilled | *
| fulfillment | *
---------------
Question: How does one write a query that returns the list of non-overlapping the longest overlapping words in such cases?
In the example above, the desired words are identified by a *
, as per the following extended explanation:
dog
andgames
don't overlap with anything, so they're the longest in their "solo/distinct" categorystatement
overlaps withstate
andstat
and is the longestfulfilled
overlaps withfulfill
and is longer (does NOT overlap withfulfillment
)fulfillment
overlaps withfulfill
and is longer (does NOT overlap withfulfilled
)
Note: Please be aware that for the sake of simplicity, the data sample is reduced. In reality there are a few million records to query and, there are no search terms known before hand, so it's not possible to directly use constructs such as WHERE word LIKE 'stat%'
. Not sure if relevant or not, but the words have a relatively short max length, eg 20.
CodePudding user response:
Something like
select word
from your_table t1
where not exists (
select word
from your_table t2
where t2.word like t1.word || '_%'
)
;
The query would benefit from the existence of an index on word
. But even then it may take a long time. In any case, you can give it a try and let us know what happens.
CodePudding user response:
As long as you compare prefixes and if one word is entirely included as prefix of the other, you may try match_recognize
to sequentially check prefix-matching in one pass.
But exists
is more clear, though you should examine performace on your real dataset with index on word
.
with a as ( select column_value as word from sys.odcivarchar2list( 'dog' , 'games' , 'stat' , 'state' , 'statement' , 'fulfill' , 'fulfilled' , 'fulfillment' ) ) select * from a match_recognize( order by word desc /*Longest first*/ measures a.word as word one row per match /*Exclude any number of words each of which is a prefix of the previous one */ pattern (a {- b* -}) define /*Current word is a prefix of the previous*/ b as prev(word) like word || '%' ) t
| WORD | | :---------- | | statement | | games | | fulfillment | | fulfilled | | dog |
db<>fiddle here
CodePudding user response:
How does one write a query that returns the list of non-overlapping the longest overlapping words in such cases?
You can do it in a single table scan with a hierarchical query:
SELECT word,
MAX(PRIOR word) AS substring
FROM words
WHERE CONNECT_BY_ISLEAF = 1
AND LEVEL IN (1,2)
CONNECT BY word LIKE PRIOR word || '%'
AND PRIOR ROWID != ROWID
GROUP BY word;
Which, for the sample data:
CREATE TABLE words (word) AS
SELECT 'dog' FROM DUAL UNION ALL
SELECT 'games' FROM DUAL UNION ALL
SELECT 'stat' FROM DUAL UNION ALL
SELECT 'state' FROM DUAL UNION ALL
SELECT 'statement' FROM DUAL UNION ALL
SELECT 'fulfill' FROM DUAL UNION ALL
SELECT 'fulfilled' FROM DUAL UNION ALL
SELECT 'fulfillment' FROM DUAL;
Outputs:
WORD SUBSTRING dog games statement state fulfilled fulfill fulfillment fulfill
db<>fiddle here
CodePudding user response:
You may simple sort the words and discard all words that are the prefix of the following word.
You must consider the last word (see the OR
predicate in the where
condition)
with words as (
select WORD,
lead(WORD) over (order by WORD) WORD_LEAD
from tabs )
select word
from words
where word != substr( WORD_LEAD, 1, length(word)) or WORD_LEAD is NULL /* last word */
order by word;
result
WORD
-----------
dog
fulfilled
fulfillment
games
statement
This solution covers the duplicates as well and does not contains the prohibitive FILTER
operation as other proposals that would trottle the performance for large tables (except for the match_recognize
solution).