Faster alternative than pandas.read_csv for reding txt files with data-CodePudding

I am working on a project where I need to process large amounts of txt files (about 7000 files), where each file has 2 columns of floats with 12500 rows.

I am using pandas, and takes about 2 min 20 sec which is a bit long. With MATLAB this takes 1 min less. I would like to get closer or faster than MATLAB.

Is there any faster alternative that I can implement with python?

I tried Cython and the speed was the same as with pandas.

Here is the code in am using. It reads the files composed of column 1 (time) and column 2 (amplitude). I calculate the envelope and make a list with the resulting envelopes for all files. I extract the time from the first file using the simple numpy.load_txt(), which is slower than pandas but no impact since it is just one file.

Any ideas?

For coding suggestions please try to use the same format as I use.

Cheers

Data example:

   1.7949600e-05  -5.3232106e-03
   1.7950000e-05  -5.6231098e-03
   1.7950400e-05  -5.9230090e-03
   1.7950800e-05  -6.3228746e-03
   1.7951200e-05  -6.2978830e-03
   1.7951600e-05  -6.6727570e-03
   1.7952000e-05  -6.5727906e-03
   1.7952400e-05  -6.9726562e-03
   1.7952800e-05  -7.0726226e-03
   1.7953200e-05  -7.2475638e-03
   1.7953600e-05  -7.1725890e-03
   1.7954000e-05  -6.9476646e-03
   1.7954400e-05  -6.6227738e-03
   1.7954800e-05  -6.4228410e-03
   1.7955200e-05  -5.8480342e-03
   1.7955600e-05  -6.1979166e-03
   1.7956000e-05  -5.7980510e-03
   1.7956400e-05  -5.6231098e-03
   1.7956800e-05  -5.3482022e-03
   1.7957200e-05  -5.1732611e-03
   1.7957600e-05  -4.6484375e-03

20 files here: https://1drv.ms/u/s!Ag-tHmG9aFpjcFZPqeTO12FWlMY?e=f6Zk38

folder_tarjet="D:\this"

if len(folder_tarjet) > 0:
      print ("You chose %s" % folder_tarjet)


list_of_files =  os.listdir(folder_tarjet) 
list_of_files.sort(key=lambda f: os.path.getmtime(join(folder_tarjet, f)))
num_files=len(list_of_files)

envs_a=[]

for elem in list_of_files:
    
    file_name=os.path.join(folder_tarjet,elem)
    
    amp=pd.read_csv(file_name,header=None,dtype={'amp':np.float64},delim_whitespace=True) 
    
    env_amplitudes = np.abs(hilbert(np.array(pd.DataFrame(amp[1]))))
    envs_a.append(env_amplitudes)

envelopes=np.array(envs_a).T 


file_name=os.path.join(folder_tarjet,list_of_files[1])
Time=np.loadtxt(file_name,usecols=0)

CodePudding user response：

I would suggest you not to use csv to store and load big data quantities, if you already have your data in csv you can still convert all of it at once into a faster format. For instance you can use pickle, h5, feather and parquet, they are non human readable but have much better performance in any other metric.

After the conversion, you will be able to load the data in few seconds (if not less than a second) instead of minutes, so in my opinion it is a much better solution than trying to make a marginal 50% optimization.

If the data is being generated, make sure to generate it in a compressed format. If you don't know which format to use, parquet for instance would be one of the fastest for reading and writing and you can also load it from Matlab.

Here you will the options already supported by pandas.

CodePudding user response：

As discussed in @Ziur Olpa's answer and the comments, a binary format is bound to be faster than to parse text.

The quick way to get those gains is to use Numpy's own NPY format, and have your reader function cache those onto disk; that way, when you re-(re-re-)run your data analysis, it will use the pre-parsed NPY files instead of the "raw" TXT files. Should the TXT files change, you can just remove all NPY files and wait a while longer for parsing to happen (or maybe add logic to look at the modification times of the NPY files c.f. their corresponding TXT files).

Something like this – I hope I got your hilbert/abs/transposition logic the way you wanted to.

def read_scans(folder):
    """
    Read scan files from a given folder, yield tuples of filename/matrix.

    This will also create "cached" npy files for each txt file, e.g. a.txt -> a.txt.npy.
    """
    for filename in sorted(glob.glob(os.path.join(folder, "*.txt")), key=os.path.getmtime):
        np_filename = filename   ".npy"
        if os.path.exists(np_filename):
            yield (filename, np.load(np_filename))
        else:
            df = pd.read_csv(filename, header=None, dtype=np.float64, delim_whitespace=True)
            mat = df.values
            np.save(np_filename, mat)
            yield (filename, mat)


def read_data_opt(folder):
    times = None
    envs = []
    for filename, mat in read_scans(folder):
        if times is None:
            # If we hadn't got the times yet, grab them from this
            # matrix (by slicing the first column).
            times = mat[:, 0]
        env_amplitudes = signal.hilbert(mat[:, 1])
        envs.append(env_amplitudes)
    envs = np.abs(np.array(envs)).T
    return (times, envs)

On my machine, the first run (for your 20-file dataset) takes

read_data_opt took 0.13302898406982422 seconds

and the subsequent runs are 4x faster:

read_data_opt took 0.031115055084228516 seconds