Using np.loadtxt to Import Data from a Strangely-Formatted Text File-CodePudding

I am trying to import data from a text file that I've received.

The text file is somewhat large (400 MB). It is available from this link (https://drive.google.com/file/d/11CwId3feJRZGvP2OUAtixuZEFztrCP3W/view?usp=sharing). It may take a few minutes to download given its size.

The data in the file are in a format I've never encountered before. The delimiter between columns seems to be a semi-colon, and the data rows seem to be separated from each other by a blank row.

I've not been able to read in the data. The following is the Python code I'm using to try to import one column of string data and two columns of float data from the file:

import numpy as np
f = 'summ.txt'
ID = np.loadtxt(f, dtype=np.str, unpack=True, usecols=[4], skiprows=8, delimiter = '; ')
hbeg, hend = np.loadtxt(f3, unpack=True, usecols=[67,73], skiprows=8, delimiter = '; ')

A solution/guidance would be wonderful.

CodePudding user response：

I would simply use csv to reformat it

import csv
import time

start = time.time()

with open('summ.txt') as fin, open('output.txt', 'w') as fout:
   csv_reader = csv.reader(fin,  delimiter=';')  # read semicolon
   csv_writer = csv.writer(fout, delimiter=',')  # write comma
   for row in csv_reader:
       if row:  # skip empty row
           row = [x.strip() for x in row]  # remove spaces
           csv_writer.writerow(row)
        
end = time.time()

print('time:', end-start)

On my computer it took ~31 seconds.

But you can also keep values as 2D list and convet to numpy array or pandas DataFrame

import csv
import time

start = time.time()

IDs  = []
hbeg = []
hend = []

with open('Pulpit/summ.txt') as fin:
    csv_reader = csv.reader(fin,  delimiter=';')
    for row in csv_reader:
        if row:
            row = [x.strip() for x in row]
            if len(row) > 1:
                IDs.append(row[4])
                hbeg.append(row[64])       
                hend.append(row[73])       
        
end = time.time()

print('time:', end-start)

print(IDs[:10])
print(hbeg[:10])
print(hend[:10])