I need your help with solving my problem. I'm new to regular expressions in Java and I don't know how to do it properly.
For example I have simple text:
In downtown Las Vegas, John spent a lot of time on The Strip, which is a 2.5 mile stretch of shopping, entertainment venues, luxury hotels, and fine dining experiences. This is probably the most commonly visited tourist area in the city. The Strip at night time looks especially beautiful. All of the buildings light up with bright, neon, eye-catching signs to attract visitors attention.
The task is to find and return all such words in the first sentence, which is not in any of the other sentences. This should be done using regular expressions.
The word in
is not suitable, as it is present in the second sentence:
"This is probably the most commonly visited tourist area ---in--- the city."
Word downtown
is suitable, as none of the following sentences contain it.
The words Las
, Vegas
, John
, spent
, a
, lot
, of
are suitable too.
The word time
is not, as it is present in the third sentence:
"The Strip at night ---time--- looks especially beautiful."
and so on for all words in the 1st sentence.
Some rules
- Sentences are separated with dots
.
- Articles are words too
- Searching is not case sensitive:
John
andjohn
are the same words
CodePudding user response:
You could achieve what you're trying to do by:
Splitting the text with a regex which matches dots only used as a period and not as decimal separator.
Create a
Set
with all the words from the first sentence, extracted with a second regex.Iterate each remaining sentence and parsing them with the second regex to verify whether your
Set
contains words which appear also in the following sentences. In that case, you simply remove the “reappearing” word.
Here is a simple implementation of the logic expressed above:
public class Main {
public static void main(String[] args) throws IOException {
String text = "In downtown Las Vegas, John spent a lot of time on The Strip, \n"
"which is a 2.5 mile stretch of shopping, entertainment venues, \n"
"luxury hotels, and fine dining experiences. This is probably the most \n"
"commonly visited tourist area in the city. The Strip at night time looks \n"
"especially beautiful. All of the buildings light up with bright, neon, \n"
"eye-catching signs to attract visitors attention.";
//Splitting by the dot character except when it represents a decimal separator
String[] sentences = text.split("(?!\\d \\.)\\.(?!\\d )");
//Set containing all the words encountered within the first sentence
Set<String> setWordsFirstSentence = new HashSet<>();
//The current regex identifies words and decimal numbers (like 2.5)
Pattern patternWords = Pattern.compile("(?m)\\b(\\d \\.\\d )|\\w \\b");
Matcher matchSentence = patternWords.matcher(sentences[0]);
//Adding each word of the first sentence within the set
while (matchSentence.find()) {
setWordsFirstSentence.add(matchSentence.group());
}
//Skipping the first sentence and removing from the set every word which appears in the following sentences
for (int i = 1; i < sentences.length; i ) {
matchSentence = patternWords.matcher(sentences[i]);
while (matchSentence.find()){
//You could either remove the word with a lambda passed to the removeIf method, although declaring variables in a loop is a bad practice...
// String tempWord = matchSentence.group();
// setWordsFirstSentence.removeIf(word -> word.equalsIgnoreCase(tempWord));
//... Or use an Iterator
for (Iterator<String> it = setWordsFirstSentence.iterator(); it.hasNext();){
if (it.next().equalsIgnoreCase(matchSentence.group())){
it.remove();
}
}
}
}
//Printing the set's words
setWordsFirstSentence.stream().forEach(System.out::println);
}
}
As explained in the code the removal can be achieved with a lambda expression passed to the removeIf
method of the Set
class, which removes an item only if the predicate expression is true. However, this implies to declare at every iteration a String
variable in order to make it effectively final, which is not really the cleanest approach nor very efficient.
Otherwise, you could use an Iterator
(better approach) to iterate your Set
and remove the word if this is equal to one of the following matched words.