How can I create an 2D int array from a .txt file (with strings, tabs and symbols)?-CodePudding

I want to create an array A[ ] [ ] in Python from a txt file.

My txt file looks like this:

A=[ -5  1   7   9   -1  -2  6   -2
    3   -3  5   1   1   -1  7   8
    4   8   -6  1   -1  2   4   -6
    1   2   -1  -1  12  6   1   8
    2   -9  15  11  9   -1  -1  -1
    3   -9  1   1   -2  1   5   9]

The numbers are Tab Delimited

Can anyone help me store this to a 2-D array? *What if the numbers were not only int and I also had floats?

CodePudding user response：

All the below methods use filedata to represent your file data. Tabs are explicitly used, because my editor turns tabs to spaces for python. The new line and strip was purposely added to the end to illustrate that you have to consider it. A float was also added for testing purposes. Otherwise, it's just a copy of what you posted. All timeit times are based on 10000 iterations.

filedata = ('A=[\t-5.6\t1\t7\t9\t-1\t-2\t6\t-2\n'
            '\t3\t-3\t5\t1\t1\t-1\t7\t8\n'
            '\t4\t8\t-6\t1\t-1\t2\t4\t-6\n'
            '\t1\t2\t-1\t-1\t12\t6\t1\t8\n'
            '\t2\t-9\t15\t11\t9\t-1\t-1\t-1\n'
            '\t3\t-9\t1\t1\t-2\t1\t5\t9]\n').strip()

First Method :

Use regex to parse the data (timeit: 1.4896928800153546)

import re

d  = re.compile(r'-?\d (\.\d*)?') #int/float regex
ch = (int, float)                 #choice

out = [] #for results

#get rid of the name, which could have numbers in it that would break this technique
filedata = filedata.split('[')[1].strip()

#iterate over lines
for line in filedata.split('\n'):
    
    #get all numbers in this line as str
    t = [m.group() for m in d.finditer(line)]
        
    #format str to float or int based on the existence of a dot
    out.append([ch['.' in i](i) for i in t])
    
print(*out, sep='\n')

However, you could actually cut the amount of iterations in half with a cleverly placed walrus(:=). The above has to loop over each line twice. Once to get the numbers, and again to retype them. The below does all of that in one loop. Although, it is actually slower.

timeit: 1.6025475490023382

#iterate over lines
for line in filedata.split('\n'):
    
    #everything in one ~ half as many iterations as the above version
    out.append([ch['.' in (i:=m.group())](i) for m in d.finditer(line)])

Second Method

Reformat the data to JSON and load (timeit: 2.2486417230102234)

import re, json

#get rid of name, and make sure we don't have a trailing new line
filedata = filedata.split('=')[1].strip()
#replace new lines with brackets
filedata = filedata.replace('\n', '],[')
#replace Num Whitespace with Num Comma
filedata = re.compile(r'(\d)\s').sub('\\1,', filedata)
#wrap
filedata = '[' filedata ']'
#load as json
out = json.loads(filedata)

print(*out, sep='\n')

Third Method

One character at a time (timeit: 0.997349611017853)

The conditions are placed in the order that things will happen to hopefully be easier to follow. This is not the best order. The best order would be to move the current if to the end, and then fix the if/elif keywords to be in the proper order. This is because you are mostly going to find numbers, so that should be the first condition. Conversely, initiating the result container will only happen once, so it should be the last condition. Doing this changes timeit to 0.9182042190223001

out = None  
num = []     
ch  = (int, float)

#iterate over every character individually 
for c in filedata:
    #initiate result container
    if c == '[':
        out = [[]]
    #store number character
    elif c in '-.0123456789':
        num.append(c)
    elif (c in '\t\n]') and num:
        #format number, append to the last child, and reset num container
        i   = ''.join(num)
        out[-1].append(ch['.' in i](i))
        num = []

        #start a new child    
        if c == '\n':
            out.append([])
        
print(*out, sep='\n')

Fourth Method

String splitting (timeit: 0.5249546770355664)

This finishes the answer provided by @Pete. If you switch the uncommented line in try with the one below it timeit goes to 0.4442364440183155.

import re

out = []  
ch  = (int, float)

try:
    #get only list guts
    filedata = re.compile(r'. =\[(.*)\]', re.S).search(filedata).group(1)
    #filedata = filedata.split('[')[1].split(']')[0].strip()
except Exception as e: 
    print(e) #issues
else:
    for line in filedata.split('\n'):
        out.append([ch['.' in i](i) for i in line.split('\t') if i])
        
print(*out, sep='\n')

All methods result in the below

#[-5.6, 1, 7, 9, -1, -2, 6, -2]
#[3, -3, 5, 1, 1, -1, 7, 8]
#[4, 8, -6, 1, -1, 2, 4, -6]
#[1, 2, -1, -1, 12, 6, 1, 8]
#[2, -9, 15, 11, 9, -1, -1, -1]
#[3, -9, 1, 1, -2, 1, 5, 9]

CodePudding user response：

Iterate through your file on a line basis, python allows you to do:

for line in file:

Split each line on tabs:

elements = line.split("\t")

Loop through the elements and add them to your array.