What is fastest way to read files line by line?-CodePudding

I have written a code in Python to read a file line by line and perform some averaging and summation operations.

I need suggestions in speeding it up.

The number of lines in the pressurefile is 945,670 for now (it will go higher).

ORIGINAL CODE This is the original version that I posted. Based on your suggestions, I am optimizing the code and posted the recent version in the end.

    def time_average():
    try:
        filename = mem.pressurefile
        navg = mem.NFRAMES
        dz = mem.dz
        zlo = mem.zlo
        NZ = mem.NZ
        mass = mem.mass

        dens_fact = amu_to_kg / (mem.slab_V * ang3_to_m3)
        
        array_pxx = np.zeros([NZ,1])
        array_pyy = np.zeros([NZ,1])
        array_pzz = np.zeros([NZ,1])
        array_ndens = np.zeros([NZ,1])
        
        array_density = np.zeros([NZ,1])
        array_enthalpy = np.zeros([NZ,1])
        array_surf_tens = np.zeros([NZ,1])
        
        counter = 0
        with open(filename) as f:
            for line in f:
                line.strip("\n")
                #content = [_ for _ in line.split()]
                content = line.split()
                if len(content) == 7:
                    z = float(content[3]) - zlo
                    pxx = float(content[4])
                    pyy = float(content[5])
                    pzz = float(content[6])
                    
                    loc = math.floor(z/dz)
                    if loc >= NZ:
                        loc = loc - NZ
                    elif loc < 0:
                        loc = loc   NZ   
                    #print(z, loc, zlo)
                    
                    array_pxx[loc]  = pxx
                    array_pyy[loc]  = pyy
                    array_pzz[loc]  = pzz
                    array_ndens[loc]  = 1
                counter  = 1
        for col in range(NZ):
            array_pxx[col] /= navg
            array_pyy[col] /= navg
            array_pzz[col] /= navg
            array_ndens[col] /= navg
            array_density[col] = mass * dens_fact * array_ndens[col]
            
        return (array_density, array_enthalpy, array_surf_tens)
    except IndexError as err:
        writelog (err)
        writelog(float(content[3]) , loc, zlo)

So far, I have tried the below options:

Profiling:

Profiles the main code using cprofile and identified the above helper function consumes ~10 s for a 74.4MB file. To me, this 10 s is high.

Option 1: cython3

compiled using cython as below.

    cython3 --embed -o ptythinfile.c ptythinfile.py

    gcc -Os -I /usr/include/python3.8 -o ptythinfile ptythinfile.c -lpython3.8 -lpthread -lm -lutil -ldl

This did not yield any performance improvements.

Option 2: C/C

Converting entire code to C/C and compile it.

In fact, my first code was in C and debugging was a nightmare and switched to python. So, I don't want to follow this route.

Option 3: Pypy3

I tried with pypy3 and ran into compatibility issues. I have python3.8 and 3.9, but the pypy3 was looking for 3.6 and then I gave up.

Option 4: External C library

I read the tutorial on compiling the helper function as a c code and calling into the python. This would be my next attempt.

Searching into the google I found many options like shedskin etc. Could you point out the best way to optimize the above code snippet and possible alternative solutions to speed it up?

UPDATE 1 : OCT 21 - 2021 The code is updated based on the comments from experts below. Tested and working well. However, average code exec time reduced from ~10 s to ~9.4s

The content of the pressurefile is an output from LAMMPS software and first few lines of it looks like:

    ITEM: TIMESTEP
    50100
    ITEM: NUMBER OF ATOMS
    2744
    ITEM: BOX BOUNDS pp pp pp
    -2.5000000000000000e 01 2.5000000000000000e 01
    -2.5000000000000000e 01 2.5000000000000000e 01
    -7.5000000000000000e 01 7.5000000000000000e 01
    ITEM: ATOMS id x y z c_1[1] c_1[2] c_1[3]
    2354 18.8358 -21.02 -70.5731 -21041.8 -3738.18 -2520.84
    1708 5.54312 -8.1526 -62.6984 4362.84 -30610.2 -4065.84

The last two lines are what we need for processing.

LATEST CODE

    def time_average():
    try:
        filename = mem.pressurefile
        navg = mem.NFRAMES
        dz = mem.dz
        zlo = mem.zlo
        NZ = mem.NZ
        mass = mem.mass

        dens_fact = amu_to_kg / (mem.slab_V * ang3_to_m3)
        
        array_pxx = np.zeros([NZ,1])
        array_pyy = np.zeros([NZ,1])
        array_pzz = np.zeros([NZ,1])
        array_ndens = np.zeros([NZ,1])
        
        #array_density = np.zeros([NZ,1])
        array_enthalpy = np.zeros([NZ,1])
        array_surf_tens = np.zeros([NZ,1])
        
        counter = 0
        locList = []
        pxxList = []
        pyyList = []
        pzzList = []
        with open(filename) as f:
            for line in f:
                #line.strip("\n")
                #content = [_ for _ in line.split()]
                content = line.split()
                if len(content) == 7:
                    z = float(content[3]) - zlo
                    pxx = float(content[4])
                    pyy = float(content[5])
                    pzz = float(content[6])
                    
                    #loc = math.floor(z/dz)
                    loc = int(z // dz)
                    
                    if loc >= NZ:
                        loc = loc - NZ
                    elif loc < 0:
                        loc = loc   NZ   
                    #print(z, loc, zlo)
                    
                    # Not great but much faster than using Numpy functions
                    locList.append(loc)
                    pxxList.append(pxx)
                    pyyList.append(pyy)
                    pzzList.append(pzz)
                counter  = 1

        # Very fast list-to-Numpy-array conversion
        locList = np.array(locList, dtype=np.int32)
        pxxList = np.array(pxxList, dtype=np.float64)
        pyyList = np.array(pyyList, dtype=np.float64)
        pzzList = np.array(pzzList, dtype=np.float64)

        # Fast accumulate
        np.add.at(array_pxx[:,0], locList, pxxList)
        np.add.at(array_pyy[:,0], locList, pyyList)
        np.add.at(array_pzz[:,0], locList, pzzList)
        np.add.at(array_ndens[:,0], locList, 1)

        array_pxx /= navg
        array_pyy /= navg
        array_pzz /= navg
        array_ndens /= navg
        array_density = mass * dens_fact * array_ndens

        return (array_density, array_enthalpy, array_surf_tens)
    except IndexError as err:
        writelog (err)
        print(loc)
        writelog(float(content[3]) , loc, zlo)

Testing computer specs:
Intel® Xeon(R) W-2255 CPU @ 3.70GHz × 20
RAM: 16 GB
NVIDIA Corporation GP107GL [Quadro P620]
64bit Ubuntu 20.04.3 LTS

Current average code exec time is ~2.6s (3x faster than original) credit to user @JeromeRichard

CodePudding user response：

First of all, Python is clearly not best tool for doing such a computation efficiently. The code is sequential and most of the time is spent in the CPython interpreter operation or Numpy internal functions.

Option 1: cython3
This did not yield any performance improvements.

This is partially because optimizations are not enabled. You need to use the flag -O2 or even -O3. Still, Cython will probably not help a lot as most of the time is spent in CPython-to-Numpy calls in this specific code.

Option 2: C/C Converting entire code to C/C and compile it. In fact, my first code was in C and debugging was a nightmare and switched to python. So, I don't want to follow this route.

You do not need to port all the code. You can rewrite only performance-critical functions like this one and put them in a dedicated CPython module (ie. writing a C/C extension). However, this solution require to deal with low-level CPython internals. Cython may help to deal with that: AFAIK, you can use Cython to call C function from a Cython function and Cython help to easily perform the interface between CPython and the C functions. Simple function interface should help to make the code easier to read and maintain. Still, I agree that this is not great, but a C code can do this computation at least an order of magnitude faster than CPython...

Searching into the google I found many options like shedskin etc.

ShedSkin is not actively developed anymore. I doubt such a project can help in you case because the code is pretty complex and use Numpy.

Numba could theoretically help a lot in this case. However, strings are not well supported yet (ie. parsing).

Could you point out the best way to optimize the above code snippet and possible alternative solutions to speed it up?

Lines like array_pxx[loc] = pxx are very slow because the interpreter need to call C Numpy function internally that performs a lot of unneeded operations: bound/type checking, type conversions, allocations/deallocations, reference counting, etc. Such an operation is very slow (>1000 times slower than in C ). One solution to avoid this, is simply to use pure-Python lists in pure-Python loops (at least when the code cannot be efficiently vectorized). You can convert list to Numpy array efficiently and perform the accumulation with np.add.at. Here is an improved implementation:

def time_average():
    try:
        filename = mem.pressurefile
        navg = mem.NFRAMES
        dz = mem.dz
        zlo = mem.zlo
        NZ = mem.NZ
        mass = mem.mass

        dens_fact = amu_to_kg / (mem.slab_V * ang3_to_m3)
        
        array_pxx = np.zeros([NZ,1])
        array_pyy = np.zeros([NZ,1])
        array_pzz = np.zeros([NZ,1])
        array_ndens = np.zeros([NZ,1])
        
        #array_density = np.zeros([NZ,1])
        array_enthalpy = np.zeros([NZ,1])
        array_surf_tens = np.zeros([NZ,1])
        
        counter = 0
        locList = []
        pxxList = []
        pyyList = []
        pzzList = []
        with open(filename) as f:
            for line in f:
                #line.strip("\n")
                #content = [_ for _ in line.split()]
                content = line.split()
                if len(content) == 7:
                    z = float(content[3]) - zlo
                    pxx = float(content[4])
                    pyy = float(content[5])
                    pzz = float(content[6])
                    
                    #loc = math.floor(z/dz)
                    loc = int(z // dz)
                    
                    if loc >= NZ:
                        loc = loc - NZ
                    elif loc < 0:
                        loc = loc   NZ   
                    #print(z, loc, zlo)
                    
                    # Not great but much faster than using Numpy functions
                    locList.append(loc)
                    pxxList.append(pxx)
                    pyyList.append(pyy)
                    pzzList.append(pzz)
                counter  = 1

        # Very fast list-to-Numpy-array conversion
        locList = np.array(locList, dtype=np.int32)
        pxxList = np.array(pxxList, dtype=np.float64)
        pyyList = np.array(pyyList, dtype=np.float64)
        pzzList = np.array(pzzList, dtype=np.float64)

        # Fast accumulate
        np.add.at(array_pxx[:,0], locList, pxxList)
        np.add.at(array_pyy[:,0], locList, pyyList)
        np.add.at(array_pzz[:,0], locList, pzzList)
        np.add.at(array_ndens[:,0], locList, 1)

        array_pxx /= navg
        array_pyy /= navg
        array_pzz /= navg
        array_ndens /= navg
        array_density = mass * dens_fact * array_ndens

        return (array_density, array_enthalpy, array_surf_tens)
    except IndexError as err:
        writelog (err)
        print(loc)
        writelog(float(content[3]) , loc, zlo)

This code is about 3 times faster overall on my machine. Note however that it should take more memory (due to the lists).

Most of the remaining time is spent in string conversions (25%), string splitting (20-25%), the list appending (17%) and the CPython interpreter itself like importing modules (20%). The I/O operations take only a tiny fraction of the overall time (on a SSD or when the file is cached by the operating system). Optimizing this is challenging as long as a pure-Python code is used (with CPython).

CodePudding user response：

The first step of reading the file can be easily done with genfromtxt. This does read the file line by line, split it (as you do), collecting the results in a list of lists, and makes the array at then end. pandas.read_csv is faster, at least when using c mode, and for large files may be worth a try.

Making a structured array that preserves the integer nature of the first column. Access to 'columns' is by field name (as specified in the dtype):

In [30]: data = np.genfromtxt('stack69665939.py',skip_header=9, dtype=None)
In [31]: data
Out[31]: 
array([(2354, 18.8358 , -21.02  , -70.5731, -21041.8 ,  -3738.18, -2520.84),
       (1708,  5.54312,  -8.1526, -62.6984,   4362.84, -30610.2 , -4065.84)],
      dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8')])

Or loading all values as float, making a (N,7) 2d array:

In [32]: data = np.genfromtxt('stack69665939.py',skip_header=9)
In [33]: data
Out[33]: 
array([[ 2.35400e 03,  1.88358e 01, -2.10200e 01, -7.05731e 01,
        -2.10418e 04, -3.73818e 03, -2.52084e 03],
       [ 1.70800e 03,  5.54312e 00, -8.15260e 00, -6.26984e 01,
         4.36284e 03, -3.06102e 04, -4.06584e 03]])

Specifying usecols to be just [3,4,5,6] might save some time. You seem to be just interested in this data:

In [35]: z = data[:,3]
In [36]: pxyz = data[:,[4,5,6]]
In [37]: z
Out[37]: array([-70.5731, -62.6984])
In [38]: pxyz
Out[38]: 
array([[-21041.8 ,  -3738.18,  -2520.84],
       [  4362.84, -30610.2 ,  -4065.84]])

It appears then that you do something with z to derive a loc, and use that to combine 'rows' of the `pxyz' array. I won't try to recreate that.

Anyways, usually when handling large csv files, we read in one step, and then process the resulting array or dataframe later. Processing while reading is possible, but usually not worth the effort.