I have a file with 500,000 lines and I want to check for each line L whether any other line in the same file ends with L.
I have already sorted the file by the length of the lines and have written the following code but it is to slow:
def main(args: Array[String]): Unit = {
val buffer = new BufferedReader(new FileReader("input.txt"))
val fw = new FileWriter("output.txt")
var line = buffer.readLine()
var list = List.empty[String]
while (line != null) {
if (line.nonEmpty) {
list = line
}
line = buffer.readLine()
}
buffer.close()
list = list.sortBy(s => s.length)
for (i <- list.indices) {
val endsWith = list(i)
for (j <- i 1 until list.size) {
val right = list(j)
if (right.endsWith(endsWith)) {
fw.write(list(j) ";" list(i) "\n")
fw.flush()
}
}
println(i 1)
}
fw.close()
}
The input file contains such entries like:
abc/defg
defg
...
Is there a more efficient way to check the lines?
CodePudding user response:
You need to sort file in specific way.
Try next algorithm:
- Revers each line.
- Sort list.
- Go thru list and for each adjacent pair and check if shorter is beginning of longer.
CodePudding user response:
I have found a solution to the problem by using the length of each line L and extract this length from the other lines. Afterward, I hash L and the extracted strings and store them in a hash map. Finally, I check whether the map contains the query hash. As soon as two lines have the same length I need only to check the hash in the map, this saves me a lot of overhead. I also used the idea from talex to reverse each string before hashing.
Here the code for the solution:
def main(args: Array[String]): Unit = {
val buffer = new BufferedReader(new FileReader("input.txt"))
val fw = new FileWriter("output.txt")
var line = buffer.readLine()
var map = Map.empty[Int, List[String]]
while (line != null) {
if (line.nonEmpty) {
val len = line.length
map = len -> (map.getOrElse(len, List.empty[String]) List(line.reverse))
}
line = buffer.readLine()
}
buffer.close()
val list = map.keySet.toList.sorted
for (i <- list.indices) {
val len = list(i)
var cutMap = Map.empty[Int, List[String]]
for (j <- i 1 until list.size) {
for (right <- map(list(j))) {
val took = right.take(len)
cutMap = took.hashCode -> (cutMap.getOrElse(took.hashCode, List.empty[String]) List(right))
}
}
for (startsWith <- map(len)) {
val hashCode = startsWith.hashCode
if (cutMap.contains(hashCode)) {
for (right <- cutMap(hashCode)) {
fw.write(right.reverse ";" startsWith.reverse "\n")
fw.flush()
}
}
}
println(i 1)
}
fw.close()
}
If the hashCode function is not precise enough it can be replace by a more precise one. For bigger files the algorithm could be separated to work with file splitting.