Home > Net >  List to numpy array in scientific notation
List to numpy array in scientific notation

Time:11-02

I have data files that consist of a few lines of header and matrices of Nx4 size. I want to read this file starting from the matrix and save it to a variable as numpy array. These files are ~300 MB each, but an example file looks like this:

# Some header line
    Not all header lines start with a special character
# -- a keyword --
 7.3533498487067E-03 0.0000000000000E 00 1.5509636485369E-25-2.0531419826552E-27
 1.7232929428188E-25 1.3463226115772E-28 1.7232929428188E-25 1.3463226115772E-28
 4.4805616513289E-25 7.5394066248323E-26 6.7208424769933E-25 1.1093698319396E-25
-6.4623485355705E-25-1.1924016124944E-25-5.6007020641611E-25-5.6915788404426E-26

If the value is positive, there is a single space, but if it's negative, there's no space. So far I tried:

matrix = []
with open('test.txt') as data:
    for line in data.readlines()[3:]: # I always know how many header lines should be skipped.
        matrix.append(line) # Saves all matrix elements into a list.
    matrix = ' '.join([i for item in matrix for i in item.split()]) # Combines all matrix elements into a single string with correct single space separation.
    matrix = np.fromstring(matrix, sep=' ') # This was supposed to convert the string into a 2D numpy array.

This code produce the error:

'DeprecationWarning: string or file could not be read to its end due to unmatched data; this will raise a ValueError in the future.'

I think it fails to read the scientific notation (this is probably wrong), but I don't know how to fix it. Also, I think I'm making it way longer than it should be, by converting it from list to str to numpy. How can I make this work with numpy? Pandas solutions are also appreciated.

Extra: I'd appreciate any solution that can get rid of header lines without creating/copying to any new files. But this is not essential.

CodePudding user response:

Apparently the point of the format is that the character length of every number is always the same, so you could exploit that:

matrix = []
with open('test.txt') as data:
    for line in data.readlines()[3:]: 
        matrix.append([float(line[i : i   20]) for i in (0, 20, 40, 60)]) 
    
matrix = np.array(matrix)
print(matrix)
[[ 7.35334985e-03  0.00000000e 00  1.55096365e-25 -2.05314198e-27]
 [ 1.72329294e-25  1.34632261e-28  1.72329294e-25  1.34632261e-28]
 [ 4.48056165e-25  7.53940662e-26  6.72084248e-25  1.10936983e-25]
 [-6.46234854e-25 -1.19240161e-25 -5.60070206e-25 -5.69157884e-26]]
  • Related