Home > Software engineering >  How can I format a file to a multidimensional numpy array for my AI
How can I format a file to a multidimensional numpy array for my AI

Time:04-21

I have a training set called data.txt stored for my AI. The AI should get 5 X inputs each run and one solution/answer which is the array y. The array X should look like following: [[x1,x2,x3,x4,x5],[x1,x2....x5],....] I tested it with 2 * 5 inputs and following came out:

    [2.21600000e 05 2.02000000e 03 2.43738600e 06 1.09990343e 01
 9.11552347e-01 2.21600000e 05 2.02000000e 03 2.43738600e 06
 1.09990343e 01 9.11552347e-01 1.00000000e 01 1.00000000e 00
 5.72000000e 02 5.72000000e 01 1.00000000e 01]

What I want is following:

[[221600,2020,2437386,10.999034296028881,0.9115523465703971],
 [10,1,572,57.2,10.0]]

The answer array y is fine. It is: [0.,0.]

The code:

import numpy
X=np.array([])
y=np.array([])
lineX=np.array([])
i=0
linenumber=0
with open('data.txt') as file:
    for line in file:
        dataline=line.rstrip()
        dataline=float(dataline)
        i =1
        linenumber =1

        if i != 6:
            lineX=np.append(lineX,dataline)
        else:
            X=np.append(X,lineX,axis=0)
            i=0
            y=np.append(y,dataline)
print(X)
print(y)

And the file (the original has about 800 lines so I shortened it)

221600
2020
2437386
10.999034296028881
0.9115523465703971
0
10
1
572
57.2
10.0
0

The first five lines in the file are the inputs x1-x5 and the sixth line is y (the answers) and so on.

How can I get it working?

CodePudding user response:

We will need two steps for this:

data = []
with open('data.txt') as file:
    for line in file:
        dataline=line.rstrip()
        dataline=float(dataline)
        data.append(dataline)
data= np.array(data)

First we put everything in a numpy array. There are more efficient ways to read in the file i would assume i.e. pandas reading it as csv but for 800 values that shouldnt matter.

data = data.reshape(-1,6)
X = data[:,0:5]
y = data[:,5]

In the second step we split the array into full samples so columns 0-4 are you X values and column 5 is your y value

EDIT, Tangent on float values:

Integers are well definied in binary i.e. 1101 is 13. Floats have a problem though, you need to make a tradeoff between accuracy, as in decimal places, and min/max values so you dont have constant buffer overflows. So you have a fixed amount of bits responsible for your decimal places and another fixed amount for your exponent. You can read up on it here.

This number in memory is always the same. What you are observing is the representation as a string when you print it. Numpy generally uses the scientific notation with is the same as format(x,'1.8e') for floats. If you want to print it in a different way use those format string to format it however you like for example you could use format(x,'1.1f') to give you the full number with a single decimal place.

  • Related