How to find specific words in the first sentence of the text using regular expressions in Java?-CodePudding

I need your help with solving my problem. I'm new to regular expressions in Java and I don't know how to do it properly.

For example I have simple text:

In downtown Las Vegas, John spent a lot of time on The Strip, which is a 2.5 mile stretch of shopping, entertainment venues, luxury hotels, and fine dining experiences. This is probably the most commonly visited tourist area in the city. The Strip at night time looks especially beautiful. All of the buildings light up with bright, neon, eye-catching signs to attract visitors attention.

The task is to find and return all such words in the first sentence, which is not in any of the other sentences. This should be done using regular expressions.

The word in is not suitable, as it is present in the second sentence:

"This is probably the most commonly visited tourist area ---in--- the city."

Word downtown is suitable, as none of the following sentences contain it. The words Las, Vegas, John, spent, a, lot, of are suitable too.

The word time is not, as it is present in the third sentence:

"The Strip at night ---time--- looks especially beautiful."

and so on for all words in the 1st sentence.

Some rules

Sentences are separated with dots .
Articles are words too
Searching is not case sensitive: John and john are the same words

CodePudding user response：

You could achieve what you're trying to do by:

Splitting the text with a regex which matches dots only used as a period and not as decimal separator.
Create a Set with all the words from the first sentence, extracted with a second regex.
Iterate each remaining sentence and parsing them with the second regex to verify whether your Set contains words which appear also in the following sentences. In that case, you simply remove the “reappearing” word.

Here is a simple implementation of the logic expressed above:

public class Main {

    public static void main(String[] args) throws IOException {
        String text = "In downtown Las Vegas, John spent a lot of time on The Strip, \n"  
                "which is a 2.5 mile stretch of shopping, entertainment venues, \n"  
                "luxury hotels, and fine dining experiences. This is probably the most \n"  
                "commonly visited tourist area in the city. The Strip at night time looks \n"  
                "especially beautiful. All of the buildings light up with bright, neon, \n"  
                "eye-catching signs to attract visitors attention.";

        //Splitting by the dot character except when it represents a decimal separator
        String[] sentences = text.split("(?!\\d \\.)\\.(?!\\d )");

        //Set containing all the words encountered within the first sentence
        Set<String> setWordsFirstSentence = new HashSet<>();

        //The current regex identifies words and decimal numbers (like 2.5)
        Pattern patternWords = Pattern.compile("(?m)\\b(\\d \\.\\d )|\\w \\b");
        Matcher matchSentence = patternWords.matcher(sentences[0]);

        //Adding each word of the first sentence within the set
        while (matchSentence.find()) {
            setWordsFirstSentence.add(matchSentence.group());
        }

        //Skipping the first sentence and removing from the set every word which appears in the following sentences
        for (int i = 1; i < sentences.length; i  ) {
            matchSentence = patternWords.matcher(sentences[i]);
            while (matchSentence.find()){
                //You could either remove the word with a lambda passed to the removeIf method, although declaring variables in a loop is a bad practice...
//                String tempWord = matchSentence.group();
//                setWordsFirstSentence.removeIf(word -> word.equalsIgnoreCase(tempWord));

                //... Or use an Iterator
                for (Iterator<String> it = setWordsFirstSentence.iterator(); it.hasNext();){
                    if (it.next().equalsIgnoreCase(matchSentence.group())){
                        it.remove();
                    }
                }
            }
        }

        //Printing the set's words
        setWordsFirstSentence.stream().forEach(System.out::println);
    }
}

As explained in the code the removal can be achieved with a lambda expression passed to the removeIf method of the Set class, which removes an item only if the predicate expression is true. However, this implies to declare at every iteration a String variable in order to make it effectively final, which is not really the cleanest approach nor very efficient.

Otherwise, you could use an Iterator (better approach) to iterate your Set and remove the word if this is equal to one of the following matched words.