Is there a more efficient way to check whether one line ends with another line in a large file-CodePudding

I have a file with 500,000 lines and I want to check for each line L whether any other line in the same file ends with L.

I have already sorted the file by the length of the lines and have written the following code but it is to slow:

def main(args: Array[String]): Unit = {
    val buffer = new BufferedReader(new FileReader("input.txt"))
    val fw = new FileWriter("output.txt")
    var line = buffer.readLine()
    var list = List.empty[String]
    while (line != null) {
      if (line.nonEmpty) {
        list  = line
      }
      line = buffer.readLine()
    }
    buffer.close()
    list = list.sortBy(s => s.length)
    for (i <- list.indices) {
      val endsWith = list(i)
      for (j <- i   1 until list.size) {
        val right = list(j)
        if (right.endsWith(endsWith)) {
          fw.write(list(j)   ";"   list(i)   "\n")
          fw.flush()
        }
      }
      println(i   1)

    }
    fw.close()
  }

The input file contains such entries like:
abc/defg
defg
...

Is there a more efficient way to check the lines?

CodePudding user response：

You need to sort file in specific way.

Try next algorithm:

Revers each line.
Sort list.
Go thru list and for each adjacent pair and check if shorter is beginning of longer.

CodePudding user response：

I have found a solution to the problem by using the length of each line L and extract this length from the other lines. Afterward, I hash L and the extracted strings and store them in a hash map. Finally, I check whether the map contains the query hash. As soon as two lines have the same length I need only to check the hash in the map, this saves me a lot of overhead. I also used the idea from talex to reverse each string before hashing.

Here the code for the solution:

def main(args: Array[String]): Unit = {
    val buffer = new BufferedReader(new FileReader("input.txt"))
    val fw = new FileWriter("output.txt")
    var line = buffer.readLine()
    var map = Map.empty[Int, List[String]]
    while (line != null) {
      if (line.nonEmpty) {
        val len = line.length
        map  = len -> (map.getOrElse(len, List.empty[String])    List(line.reverse))
      }
      line = buffer.readLine()
    }
    buffer.close()
    val list = map.keySet.toList.sorted
    for (i <- list.indices) {
      val len = list(i)
      var cutMap = Map.empty[Int, List[String]]
      for (j <- i   1 until list.size) {
        for (right <- map(list(j))) {
          val took = right.take(len)
          cutMap  = took.hashCode -> (cutMap.getOrElse(took.hashCode, List.empty[String])    List(right))
        }
      }
      for (startsWith <- map(len)) {
        val hashCode = startsWith.hashCode
        if (cutMap.contains(hashCode)) {
          for (right <- cutMap(hashCode)) {
            fw.write(right.reverse   ";"   startsWith.reverse   "\n")
            fw.flush()
          }
        }
      }
      println(i   1)
    }
    fw.close()
  }

If the hashCode function is not precise enough it can be replace by a more precise one. For bigger files the algorithm could be separated to work with file splitting.