Using numpy `fromregex` to read data from a file-CodePudding

I need to read some data from several text files that have random number lines of text at the beginning. Typically the files look like:

file1.dat:

The file contains data
# this is a comment skip me
DataStart
  index = integer
Some text

 -5.0e-2 3.3 4.0
 0 0.0e0 0.0e0
 1.0 0.1 3.0
 1.5 4.0 1.87
 1.7 -4.67 0.124
 ...
 ...
 15.3 -3.5e02 1.775

At the beginning of file1.dat it may contain several lines of text that could start with spaces, tabs, etc.
The block of data I am interested in is always below those lines and has a fixed number of columns, in this case, it has 3 columns:

 -5.0e-2 3.3 4.0
 0 0.0e0 0.0e0
 1.0 0.1 3.0
 1.5 4.0 1.87
 1.7 -4.67 0.124
 ...
 ...
 15.3 -3.5e02 1.775

The lines containing the data could may have spaces/tabs at the start of each line.

I have tried the following code:

import numpy as np

pattern = r'^[-0-9 ]*' 
mydata = np.fromregex('file1.dat', pattern, dtype=float)

But when I run it I get:

~/.local/lib/python3.8/site-packages/numpy/lib/npyio.py in fromregex(file, regexp, dtype, encoding)
   1530             # Create the new array as a single data-type and then
   1531             #   re-interpret as a single-field structured array.
-> 1532             newdtype = np.dtype(dtype[dtype.names[0]])
   1533             output = np.array(seq, dtype=newdtype)
   1534             output.dtype = dtype

TypeError: 'NoneType' object is not subscriptable

Your help is very much appreciated

CodePudding user response：

I think your regex needs to look more like this:

pattern = r'\s*([- 0-9e.] )\s ([- 0-9e.] )\s ([- 0-9e.] ).*'

CodePudding user response：

In [603]: txt="""-5.0e-2 3.3 4.0
     ...:  0 0.0e0 0.0e0
     ...:  1.0 0.1 3.0
     ...:  1.5 4.0 1.87
     ...:  1.7 -4.67 0.124
     ...:  15.3 -3.5e02 1.775"""

The number layout looks regular enough to the standard csv reader:

In [604]: np.genfromtxt(txt.splitlines())
Out[604]: 
array([[-5.000e-02,  3.300e 00,  4.000e 00],
       [ 0.000e 00,  0.000e 00,  0.000e 00],
       [ 1.000e 00,  1.000e-01,  3.000e 00],
       [ 1.500e 00,  4.000e 00,  1.870e 00],
       [ 1.700e 00, -4.670e 00,  1.240e-01],
       [ 1.530e 01, -3.500e 02,  1.775e 00]])

or even line split:

In [605]: alist=[]
     ...: for line in txt.splitlines():
     ...:     alist.append(line.split())
     ...: 
In [606]: alist
Out[606]: 
[['-5.0e-2', '3.3', '4.0'],
 ['0', '0.0e0', '0.0e0'],
 ['1.0', '0.1', '3.0'],
 ['1.5', '4.0', '1.87'],
 ['1.7', '-4.67', '0.124'],
 ['15.3', '-3.5e02', '1.775']]
In [607]: np.array(alist, float)
Out[607]: 
array([[-5.000e-02,  3.300e 00,  4.000e 00],
       [ 0.000e 00,  0.000e 00,  0.000e 00],
       [ 1.000e 00,  1.000e-01,  3.000e 00],
       [ 1.500e 00,  4.000e 00,  1.870e 00],
       [ 1.700e 00, -4.670e 00,  1.240e-01],
       [ 1.530e 01, -3.500e 02,  1.775e 00]])

CodePudding user response：

To match a floating-point number, we can use the following regex (see this answer for details):

[ \-]?(?:0|[1-9]\d*)(?:\.\d )?(?:[eE][ \-]?\d )?

You need to add that inside a group () to extract the tokens from each line:

# zero or more white spaces
opt_whitespace = r'\s*'

# The number token
number= r'([ \-]?(?:0|[1-9]\d*)(?:\.\d )?(?:[eE][ \-]?\d )?)'

# one or more whitespaces
whitespace= r'\s '

# Number of data columns
N = 3

# The regex 
pattern = opt_whitespace   number   (whitespace   number)*(N-1)   opt_whitespace   r'\n'

data = np.fromregex('file1.dat', pattern, dtype=float)
print(data)

Output:

[[-5.000e-02  3.300e 00  4.000e 00]
 [ 0.000e 00  0.000e 00  0.000e 00]
 [ 1.000e 00  1.000e-01  3.000e 00]
 [ 1.500e 00  4.000e 00  1.870e 00]
 [ 1.700e 00 -4.670e 00  1.240e-01]
 [ 1.530e 01 -3.500e 02  1.775e 00]]