Home > Enterprise >  How to get the longest "overlapping" words
How to get the longest "overlapping" words

Time:11-03

Premise: Suppose you have a table containing words, where some may be distinct and some "may overlap", meaning a longer word starts with a shorter one, eg:

---------------
|    word     |
---------------
| dog         | *
| games       | *
| stat        |
| state       | 
| statement   | *
| fulfill     |
| fulfilled   | *
| fulfillment | *
---------------

Question: How does one write a query that returns the list of non-overlapping the longest overlapping words in such cases?

In the example above, the desired words are identified by a *, as per the following extended explanation:

  • dog and games don't overlap with anything, so they're the longest in their "solo/distinct" category
  • statement overlaps with state and stat and is the longest
  • fulfilled overlaps with fulfill and is longer (does NOT overlap with fulfillment)
  • fulfillment overlaps with fulfill and is longer (does NOT overlap with fulfilled)

Note: Please be aware that for the sake of simplicity, the data sample is reduced. In reality there are a few million records to query and, there are no search terms known before hand, so it's not possible to directly use constructs such as WHERE word LIKE 'stat%'. Not sure if relevant or not, but the words have a relatively short max length, eg 20.

CodePudding user response:

Something like

select word
from   your_table t1
where  not exists (
                    select word
                    from   your_table t2
                    where  t2.word like t1.word || '_%'
                  )
;

The query would benefit from the existence of an index on word. But even then it may take a long time. In any case, you can give it a try and let us know what happens.

CodePudding user response:

As long as you compare prefixes and if one word is entirely included as prefix of the other, you may try match_recognize to sequentially check prefix-matching in one pass.

But exists is more clear, though you should examine performace on your real dataset with index on word.

with a as (
  select
    column_value as word
  from sys.odcivarchar2list(
    'dog'
    , 'games'
    , 'stat'
    , 'state'
    , 'statement'
    , 'fulfill'
    , 'fulfilled'
    , 'fulfillment'
  )
)
select *
from a 
match_recognize(
  order by word desc /*Longest first*/
  measures
    a.word as word
  one row per match
  /*Exclude any number of words
    each of which is a prefix of the previous one
  */
  pattern (a {- b* -})
  define
    /*Current word is a prefix of the previous*/
    b as prev(word) like word || '%'
) t
| WORD        |
| :---------- |
| statement   |
| games       |
| fulfillment |
| fulfilled   |
| dog         |

db<>fiddle here

CodePudding user response:

How does one write a query that returns the list of non-overlapping the longest overlapping words in such cases?

You can do it in a single table scan with a hierarchical query:

SELECT word,
       MAX(PRIOR word) AS substring
FROM   words
WHERE  CONNECT_BY_ISLEAF = 1
AND    LEVEL IN (1,2)
CONNECT BY word LIKE PRIOR word || '%'
AND        PRIOR ROWID != ROWID
GROUP BY word;

Which, for the sample data:

CREATE TABLE words (word) AS
SELECT 'dog' FROM DUAL UNION ALL
SELECT 'games' FROM DUAL UNION ALL
SELECT 'stat' FROM DUAL UNION ALL
SELECT 'state' FROM DUAL UNION ALL
SELECT 'statement' FROM DUAL UNION ALL
SELECT 'fulfill' FROM DUAL UNION ALL
SELECT 'fulfilled' FROM DUAL UNION ALL
SELECT 'fulfillment' FROM DUAL;

Outputs:

WORD SUBSTRING
dog
games
statement state
fulfilled fulfill
fulfillment fulfill

db<>fiddle here

CodePudding user response:

You may simple sort the words and discard all words that are the prefix of the following word.

You must consider the last word (see the OR predicate in the where condition)

with words as (
select WORD,
lead(WORD) over (order by WORD) WORD_LEAD
from tabs )
select word 
from words
where word != substr( WORD_LEAD, 1, length(word)) or WORD_LEAD is NULL /* last word */
order by word;

result

WORD 
-----------
dog
fulfilled
fulfillment
games
statement

This solution covers the duplicates as well and does not contains the prohibitive FILTER operation as other proposals that would trottle the performance for large tables (except for the match_recognizesolution).

  • Related