Convert string to multidimensional array in Python-CodePudding

I'm having a problem managing some data that are saved in a really awful format.

I have data for points that correspond to the edges of a polygon. The data for each polygon is separated by the string >, while the x and y values for the points are separated with non-unified criteria, sometimes with a number of spaces, sometimes with some spaces and a tabulation. I've tried to load such data to an array of arrays with the following code:

f = open('/Path/Data.lb','r')
data = f.read()
splat = data.split('>')

region = []

for number, polygon in enumerate(splat[1:len(splat)], 1):
    region.append(float(polygon))

But I keep getting an error trying to run the float() function (I've cut it as it's much longer):

ValueError: could not convert string to float: '\n     -73.311      -48.328\n     -73.311      -48.326\n     -73.318      -48.321\n     ...
...     -73.324\t  -48.353\n     -73.315\t  -48.344\n     -73.313\t  -48.337\n'

Is there a way to convert the data to float without modifying the source file? If not, is there a way to easily modify the source file so that all columns are separated the same way? I guess that way the same code should run smoothly.

Thanks!

CodePudding user response：

You can use regex to match decimal numbers.

import re
PATH = <path_to_file>
coords = []
with open(PATH) as f:
    for line in f:
        nums = re.findall('-?\d \.\d ', line)
        if len(nums) >0:
            coords.append(nums)
print(coords)

Note: this solution ignores the trailing 0 at the end of some lines. Be aware that the results in coords are still strings. You'll need to convert them to float using float().

CodePudding user response：

Try:

with open("PataIce.lb", "r") as file:
    polygons = file.read().strip(">").strip().split(">")
    
region =[]
for polygon in polygons:
    sides = polygon.strip().split("\n")
    points = [[float(num) for num in side.split()[:2]] for side in sides]
    region.append(points)

Some of the points contain more than 2 values and I've restricted the script to only read the first two numbers in these cases.

CodePudding user response：

In [79]: astr = '\n     -73.311      -48.328\n     -73.311      -48.326\n     -73.318      -48.321\n  -73.324\
    ...: t  -48.353\n     -73.315\t  -48.344\n     -73.313\t  -48.337\n'
In [80]: lines =astr.splitlines()
In [81]: lines
Out[81]: 
['',
 '     -73.311      -48.328',
 '     -73.311      -48.326',
 '     -73.318      -48.321',
 '  -73.324\t  -48.353',
 '     -73.315\t  -48.344',
 '     -73.313\t  -48.337']

splitlines deals with the \n separator; split() handles the tab and spaces.

In [82]: [line.split() for line in lines]
Out[82]: 
[[],
 ['-73.311', '-48.328'],
 ['-73.311', '-48.326'],
 ['-73.318', '-48.321'],
 ['-73.324', '-48.353'],
 ['-73.315', '-48.344'],
 ['-73.313', '-48.337']]

The initial [] needs to be removed one way or other:

In [84]: np.array(Out[82][1:], dtype=float)
Out[84]: 
array([[-73.311, -48.328],
       [-73.311, -48.326],
       [-73.318, -48.321],
       [-73.324, -48.353],
       [-73.315, -48.344],
       [-73.313, -48.337]])

This works only if each line has the same number of elements, where 2. As long as the lists of strings in Out[82] is clean enough you can let np.array do the conversion from string to float.

Your actually file may require some further handling, but this should give you an idea of the basics.