I need to read some data from several text files that have random number lines of text at the beginning. Typically the files look like:
file1.dat
:
The file contains data
# this is a comment skip me
DataStart
index = integer
Some text
-5.0e-2 3.3 4.0
0 0.0e0 0.0e0
1.0 0.1 3.0
1.5 4.0 1.87
1.7 -4.67 0.124
...
...
15.3 -3.5e02 1.775
- At the beginning of
file1.dat
it may contain several lines of text that could start with spaces, tabs, etc. - The block of data I am interested in is always below those lines and has a fixed number of columns, in this case, it has 3 columns:
-5.0e-2 3.3 4.0
0 0.0e0 0.0e0
1.0 0.1 3.0
1.5 4.0 1.87
1.7 -4.67 0.124
...
...
15.3 -3.5e02 1.775
The lines containing the data could may have spaces/tabs at the start of each line.
I have tried the following code:
import numpy as np
pattern = r'^[-0-9 ]*'
mydata = np.fromregex('file1.dat', pattern, dtype=float)
But when I run it I get:
~/.local/lib/python3.8/site-packages/numpy/lib/npyio.py in fromregex(file, regexp, dtype, encoding)
1530 # Create the new array as a single data-type and then
1531 # re-interpret as a single-field structured array.
-> 1532 newdtype = np.dtype(dtype[dtype.names[0]])
1533 output = np.array(seq, dtype=newdtype)
1534 output.dtype = dtype
TypeError: 'NoneType' object is not subscriptable
Your help is very much appreciated
CodePudding user response:
I think your regex needs to look more like this:
pattern = r'\s*([- 0-9e.] )\s ([- 0-9e.] )\s ([- 0-9e.] ).*'
CodePudding user response:
In [603]: txt="""-5.0e-2 3.3 4.0
...: 0 0.0e0 0.0e0
...: 1.0 0.1 3.0
...: 1.5 4.0 1.87
...: 1.7 -4.67 0.124
...: 15.3 -3.5e02 1.775"""
The number layout looks regular enough to the standard csv reader:
In [604]: np.genfromtxt(txt.splitlines())
Out[604]:
array([[-5.000e-02, 3.300e 00, 4.000e 00],
[ 0.000e 00, 0.000e 00, 0.000e 00],
[ 1.000e 00, 1.000e-01, 3.000e 00],
[ 1.500e 00, 4.000e 00, 1.870e 00],
[ 1.700e 00, -4.670e 00, 1.240e-01],
[ 1.530e 01, -3.500e 02, 1.775e 00]])
or even line split:
In [605]: alist=[]
...: for line in txt.splitlines():
...: alist.append(line.split())
...:
In [606]: alist
Out[606]:
[['-5.0e-2', '3.3', '4.0'],
['0', '0.0e0', '0.0e0'],
['1.0', '0.1', '3.0'],
['1.5', '4.0', '1.87'],
['1.7', '-4.67', '0.124'],
['15.3', '-3.5e02', '1.775']]
In [607]: np.array(alist, float)
Out[607]:
array([[-5.000e-02, 3.300e 00, 4.000e 00],
[ 0.000e 00, 0.000e 00, 0.000e 00],
[ 1.000e 00, 1.000e-01, 3.000e 00],
[ 1.500e 00, 4.000e 00, 1.870e 00],
[ 1.700e 00, -4.670e 00, 1.240e-01],
[ 1.530e 01, -3.500e 02, 1.775e 00]])
CodePudding user response:
To match a floating-point number, we can use the following regex (see this answer for details):
[ \-]?(?:0|[1-9]\d*)(?:\.\d )?(?:[eE][ \-]?\d )?
You need to add that inside a group ()
to extract the tokens from each line:
# zero or more white spaces
opt_whitespace = r'\s*'
# The number token
number= r'([ \-]?(?:0|[1-9]\d*)(?:\.\d )?(?:[eE][ \-]?\d )?)'
# one or more whitespaces
whitespace= r'\s '
# Number of data columns
N = 3
# The regex
pattern = opt_whitespace number (whitespace number)*(N-1) opt_whitespace r'\n'
data = np.fromregex('file1.dat', pattern, dtype=float)
print(data)
Output:
[[-5.000e-02 3.300e 00 4.000e 00]
[ 0.000e 00 0.000e 00 0.000e 00]
[ 1.000e 00 1.000e-01 3.000e 00]
[ 1.500e 00 4.000e 00 1.870e 00]
[ 1.700e 00 -4.670e 00 1.240e-01]
[ 1.530e 01 -3.500e 02 1.775e 00]]