I currently have a file that contains a few hundred thousand words and I need to find valid pairs within the file. It’s formatted as such:
Animals
Should
Blood
Nice
Floral
The
I’m
Laptop
Etc etc
I was just wondering what an efficient way of finding pairs would be if I haven’t already produced the most efficient solution (which I doubt). I currently iterate over the file and finding valid words that can be used in pairs and then iterating over the valid words twice to find the actual pairs.
CodePudding user response:
I would say to use the regex approach:
public List<String> searchStreamRegex() {
final Pattern pattern = Pattern.compile("your regex");
InputStream inputStream = new FileInputStream(pathFile);
BufferedReader bufferedReader = new BufferedReader(new
InputStreamReader(inputStream));
Stream<String> linesStream = bufferedReader.lines();
return linesStream.filter(a ->
pattern.matcher(a).find()).collect(Collectors.toList());
}
You can find a benchmark here that should help you to define the best approach in your scenario: https://medium.com/geekculture/search-string-in-file-with-java-as-fast-as-possible-fedafdc7a3ee
CodePudding user response:
Without any information about what constitutes a "pair" and how it might be found more efficiently than checking the return value of isPair(word1, word2)
for each permutation of word1
and word2
):
Assuming you're using a moderately modern computer, the quickest way might be to read the entire file from disc, parse it into a datastructure and then traverse that list of strings at your leisure. A list of < 1 mio. strings of length < 10 chars should be less that 10 MB, that sounds manageable.
The classical brute force approach would be something like (pseudo-code)
for (i = 0 .. list.length)
for (j = i .. list.length)
if (isPair(list[i], list[j]))
handlePair(list[i], list[j])
Should work, and reasonably quickly -- but depending on whether or not this snippet fits your situation at all, and on your implementation of isPair()
and handlePair()
and their complexity, this might be totally inappropriate.
Need more info...