Find in string where words from array are next to each other-CodePudding

Say I have a sentence or two in a string, and I have an array of words. I need to find anywhere in the string where two or more words from the array are next to each other.

Example:

Words: ['cat','dog','and','the']

String: There is a dog and cat over there. The cat likes the dog.

Result: ['dog and cat','the dog','the cat']

The only way I've been able to do this is manually specifying possible combinations, but only for 3 words max as it gets long fast.

CodePudding user response：

You can use two pointers to iterate over the array keeping track of beginning and end of each sequence of words that are included in the words array. Here first transforming the string to an array of lowercase words with punctuation removed (you would need to expand on the characters to remove).

const
  words = ['cat', 'dog', 'and', 'the'],
  string = 'There is a dog and cat over there. The cat likes the dog.',
  expected = ['dog and cat', 'the dog', 'the cat']

let
  stringArray = string.toLowerCase().replace(/[.,]/g, '').split(' '),
  start = 0, end = 0, result = [];

while (start < stringArray.length) {
  if (words.includes(stringArray[start])) {
    end = start   1;
    
    while (words.includes(stringArray[end])) {
      end  
    }

    if (end - start >= 2) {
      result.push(stringArray.slice(start, end).join(' '));
    }

    start = end;
  }
  
  start  
}

console.log(result)

CodePudding user response：

I'd do it with two reduces: one that groups successive words in the target set by accumulating them in arrays, and a another that rejects empty arrays (where runs end) and joins the successive sets...

const words = ['cat','dog','and','the'];
const wordSet = new Set(words); // optional for O(1) lookup
const string = 'There is a dog and cat over there. The cat likes the dog.';

const tokens = string.split(/[ .] /).map(t => t.toLowerCase());  // split for space and periods, force lower case

const result = tokens
  .reduce((acc, word) => {
    if (wordSet.has(word)) acc[acc.length-1].push(word);
    else acc.push([]);
    return acc;
  }, [[]])
  .reduce((acc, run) => {
    if (run.length) acc.push(run.join(' '));
    return acc;
  }, []);

console.log(result);

CodePudding user response：

This problem could be approached by 'walking through' the sentence, beginning at each word and continuing each pass until the word in the sentence is no longer present in the array.

For example, the first iteration would start at the first word of the sentence and check whether it's in the array. If not in the array, begin again at the second word. If the word is present, check the next, ending if it's not in the array, or continuing if it is.

Two while loops allow for this. Non-alphabet characters such as punctuation are removed for the presence test using a regex.replace statement, while capitals are changed to lower case for the comparison:

sentenceWordArray[position].toLowerCase().replace(/[^a-z] /g, '')

a break statement is required in the inner while loop to prevent an out-of-bounds error should the position exceed the length of the sentence word array.

Working snippet:

const words = ['cat','dog','and','the'];
const sentence = "There is a dog and cat over there. The cat likes the dog."


function matchWordRuns(sentence, dictionary) {
  const sentenceWordArray = sentence.split(" ");
  const results = [];
  let position = 0;
  const currentSearch = [];

  while (position < sentenceWordArray.length) {
    while (dictionary.indexOf(sentenceWordArray[position].toLowerCase().replace(/[^a-z] /g, '')) > -1){
      currentSearch.push(sentenceWordArray[position].toLowerCase().replace(/[^a-z] /g, ''));
      position  ;
        if (position>=sentenceWordArray.length) {
          break;
        }
    
    } //  end while word matched;
    
    if (currentSearch.length>0) {
      results.push(currentSearch.join(" "));
    } // end if;
  
    position  ;
    currentSearch.length=0; // empty array;

  } // end while, search over;

return results;

} // end function;

console.log(matchWordRuns(sentence, words));

/*
result:
[
  "dog and cat",
  "the cat",
  "the dog"
]
*/

CodePudding user response：

Same idea as pilchard's, with several refinements:

Using a regular expression with Unicode character class to know what "letters" are, and where sentences end — consequently, we don't need to list punctuation explicitly, and it should work on any language (e.g. "日本語！", which does not have ".", nor matches [a-z])
The result is made from substrings of the original string, so it preserves case and intervening punctuation (which may or may not be what OP wants; pass it again through .toLowerCase and .replace, if necessary)
Set for efficiency (assuming string and words are long enough to make it worth it)
Generator function for more flexibility and just because I don't see them often :P
Processes sentences separately, so it does not detect "cat. The dog"

const words = ['cat','dog','and','the'];
const string = "There is a dog and cat over there. The cat likes the dog.";

function* findConsecutive(words, string) {
  const wordSet = new Set(words.map(word => word.toLowerCase()));
  const sentences = string.split(/\s*\p{Sentence_Terminal} \s*/u);
  for (const sentence of sentences) {
    let start = null, end;
    const re = /\p{Letter} /gu;
    while ((match = re.exec(sentence)) !== null) {
      if (wordSet.has(match[0].toLowerCase())) {
        start ??= match.index;
        end = match.index   match[0].length;
      } else if (start !== null) {
        yield sentence.substring(start, end);
        start = null;
      }
    }
    if (start !== null) {
      yield sentence.substring(start, end);
    }
  }
}

console.log([...findConsecutive(words, string)]);

CodePudding user response：

This also works for the corner case were 2 consecutive words come between the ending of a sentence and beginning of a new one. Something like "A cat. The Dog" will not match, because technically they are not consecutive words. There is a dot between them. The program treats a dot as a word, by removing all the dots in the text, and reinserting them between spaces. The text then removes any extra spaces between words, to have only one space between them and dots, before splitting the text into words:

const words = ['cat', 'dog', 'and', 'the']
const text = 'There is a dog and cat over there. A cat. The cat likes the dog.'
const xs = text.toLowerCase().replace(/\./g," . ").replace(/  (?= )/g,'').split(' ')

var result = []
var matched = []
var count = 0

xs.forEach(x => {
     if (words.includes(x)) {
         count  = 1
         matched.push(x)
     } else {
         if (count > 1) 
            result.push(matched.join(' '))
         count = 0
         matched = []
     }
})

console.log(result)

Result: ['dog and cat', 'the dog', 'the cat']