while(tokenizer.hasMoreTokens()){
currentWord = tokenizer.nextToken();
String[] parts = currentWord.split(Pattern.quote("."));
String[] parts2 = parts[0].split(Pattern.quote(","));
String[] parts3 = parts2[0].split(Pattern.quote("?"));
String[] parts4 = parts3[0].split(Pattern.quote("\\.| "));
String[] parts5 = parts4[0].split("\"");
String[] parts6 = parts5[0].split(Pattern.quote(":"));
System.out.println(Arrays.toString(parts6));
I'm just trying to get this text to split properly, only issue right now is the word:
"Have
Also if someone could provide a solution that combines all this into one line that would be nice but I couldn't get that to work thanks
CodePudding user response:
Try this.
The \ is to escape the ", and the "\\" are to escape the regex special characters "." & "?". We are replacing any of these .,":? with an empty string.
while(tokenizer.hasMoreTokens()){
currentWord = tokenizer.nextToken();
final String cleanWord = currentWord.replaceAll("[\\.,\":\\?]", "");
System.out.println(cleanWord);
}
CodePudding user response:
There are specialized classes in the API to parse words out of text. Here is one such:
import java.text.BreakIterator;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Stream;
import java.nio.file.Files;
import java.nio.file.Paths;
public class WordCollector {
public static void main(String[] args) {
try {
List<String> words = WordCollector.getWords(Files.lines(Paths.get(args[0])));
System.out.println(words);
} catch (Throwable t) {
t.printStackTrace();
}
}
public static List<String> getWords(Stream<String> lines) {
List<String> result = new ArrayList<>();
BreakIterator boundary = BreakIterator.getWordInstance();
lines.forEach(line -> {
boundary.setText(line);
int start = boundary.first();
for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) {
String candidate = line.substring(start, end).replaceAll("\\p{Punct}", "").trim();
if (candidate.length() > 0) {
result.add(candidate);
}
}
});
return result;
}
}
CodePudding user response:
Here is one way if you want to split the line on non-letters.
[^A-Za-z]
split on one or more of non-letters
String line = "wordA, wordB; wordC;;; wordD, wordE!? - !wordF??, !wordG!, wordH, wordI";
String[] words = line.split("[^A-Za-z] ");
for (String word : words) {
System.out.println(word);
}
prints
wordA
wordB
wordC
wordD
wordE
wordF
wordG
wordH
wordI
On the other hand, if you want to remove those characters from a word, use a similar pattern. No need to specify the non-letter characters separately.
String word = "C:om!>{}.p*u**te,;rs";
word = word.replaceAll("[^A-Za-z]","");
System.out.println(word);
prints
Computers
CodePudding user response:
The code below shows how you can ignore all none alpha characters.
import java.io.*;
public class Main{
public static void main(String[] args) throws IOException {
int c = 0;
while((c=System.in.read())!=-1)
if (('a' <= c && c <= 'z') || ('A' <= c && c <='Z')
System.out.print((char)c);
}
}