using toRegex() with split and replace increases performance-CodePudding

I am reading a file line by line consisting of 4 million records where I do split and replace operations to be used further down the line. While doing the split and replace, I found that using toRegex() boosts the performance.

private fun parseRecord(record: String) { // this is the method that will be called for each line
     val tokens = record.split(" ") //record will always contain 2 tokens
   
}

// pseudo-code for calling parseRecord() for each line in the file
for line in file:
   parseRecord(line)

This takes approximately 30seconds to complete the processing. Whereas if I use below code, it completes in 20seconds.

private fun parseRecord(record: String) { 
     val tokens = record.split(" ".toRegex()) 
}

The above scenario applies for replace as well.

What is the reason behind the increase in the performance while using toRegex()? Also, is there any other efficient way to further more optimise this?

Thanks in advance for the help!

CodePudding user response：

Although both calls look similar, their implementation is very different.

record.split(" ") is implemented by traveling along your record and doing the necessary book keeping.

record.split(" ".toRegex()) is actually implemented as regex.split(record) and uses regular expressions to do the work. This is often more efficient than the naive approach.

You may even squeeze a bit more performance by reusing your regular expression instead of creating a new one every time you split:


val SPACE_REGEX = " ".toRegex()

...

val tokens = record.split(SPACE_REGEX)

CodePudding user response：

If your file is like below, you can avoid toRegex() and pass " " to split because the columns are separated by one literal space, which can be handled normal split. It'd be great performance improvement since Regex may not recognize the situation.

format:

foo1 bar1
foo2 bar2
foo3 bar3
foo4 bar4
foo5 bar5
foo6 bar6
foo7 bar7

You may want to readLines then parallelStream to make your task done in multi-core, if you're writing code on Kotlin/JVM. So:

val processed = File("/path/to/file") // replace this with actual path
    .readLines()
    .parallelStream()
    .map { line ->
        replace(split(line)) // assuming `replace` and `split` are given from outer scope
    }
    .toList()