Splitting a list of strings into list of tuples rapidly-CodePudding

I'm trying to figure out how to squeeze as much performance out of my code as possible, and I am facing with the issue of losing a lot of performance on tuple conversion.

with open("input.txt", 'r') as f:
    lines = f.readlines()

lines = [tuple(line.strip().split()) for line in lines]

Unfortunately a certain component of my code requires the list to contain tuples instead of lists after splitting, but converting the lists to tuples this way is very slow. Is there any way to force .split() to return a tuple or to perform this conversion much faster, or would I have to change the rest of the code to avoid .split()?

CodePudding user response：

Creating tuples doesn't seem to be especially slow

I don't think tuple creation is the performance bottleneck in this question. (Profiled on a run reading 6.2MB sized text file)

Code:

import cProfile


def to_tuple(l):
    return tuple(l)


with open('input.txt', 'r') as f:
    lines = f.readlines()
cProfile.run("lines = [to_tuple(line.strip().split()) for line in lines]")

Profile result:

time python3 tuple-perf.py
         385375 function calls in 0.167 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.072    0.072    0.165    0.165 <string>:1(<listcomp>)
        1    0.002    0.002    0.166    0.166 <string>:1(<module>)
   128457    0.017    0.000    0.017    0.000 tuple-perf.py:5(to_tuple)
        1    0.000    0.000    0.167    0.167 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
   128457    0.062    0.000    0.062    0.000 {method 'split' of 'str' objects}
   128457    0.013    0.000    0.013    0.000 {method 'strip' of 'str' objects}

If your observation on the profiling result is different you could edit the answer to add more details.

Possible solutions

Use generator

iter_lines = (tuple(line.strip().split()) for line in lines)

This is useful if you could process lines asynchronously. For example, if you need to send one API request per line or publish them to a message queue so that the lines could be processed by another process, using generator could let you pipeline the workload instead of having to wait for all lines to be processed first.

However, if you need all lines at once as input data for the next step in data processing, it's not gonna help much.

Use another fast language to process that part of data

If you need a list with complete data and still needs to squeeze every bit of performance, your best bet is to use another faster language to process the part.

However, I would strongly recommend you do some detailed profiling first. All performance optimization starts from profiling, otherwise it's very easy to make the wrong call and spend effort on something that doesn't really improve performance much.

CodePudding user response：

No problem at all with your code. In general all directives you have used, are fast, or better all are atomic operations that we cannot optimize.

A suggestion, avoid to load the file entirely at once and "walk on it" use generators, this should work efficiently:

with open("input.txt", 'r') as f:
    lines = [tuple(line.strip().split()) for line in f]

Hint

If the file is huge and you cannot break down the problem can be very efficient to break down the input data.

An option can be to split the input file, spread the computation, and put all results together. I've found an interesting utility that can be used to split the problem into smaller parts.

https://pypi.org/project/filesplit/

When you have multiple inputs using multiprocessing.Pool you are able to distribute the computation. See if this example can help you

https://superfastpython.com/multiprocessing-pool-example/