Home > Software design >  How to load .txt file as a numpy array such that it only reads in certain lines?
How to load .txt file as a numpy array such that it only reads in certain lines?

Time:06-30

I have a text file that contains xyz coordinates broken up by lines of text (specifically, the first 2 lines are text, then the next 22 are coordinates, and the next 2 are text, etc. It goes on like that for the rest of the file). I want to read in the file such that it will be a numpy array (or list, either works) that contains all the different sets of coordinates in separate lists/arrays.

So:

[[x1 y1 z1],[x2 y2 z2],...]

Here is what I have tried:

    def convert_xyz_bat(filename, newfile): #add self later
        with open(filename, "r") as f:
        coords = []
            for line in f:
                if "C" in line or "H" in line:
                    atom,x,y,z = line.split(" ")
                    coords.append([float(x), float(y), float(z)])
                else:
                    pass
            coordinates = np.array(coords, dtype=object)
        return print(coordinates[0])

This takes up a lot of memory since it writes all the lines to this variable (the file is really large). I'm not sure if this will use less memory or not, but I could also do something like this, where I make another file which contains all the coordinates:

with open(filename, "r") as f:
    with open(newfile, "r ") as f1:
        for line in f:
            if "C" in line or "H" in line:
                atom, x,y,z = line.split(" ")
                f1.write(str([float(x), float(y), float(z)]))
            else:
                pass
return 

If I make the file, the problem with that is it only lets me write the coordinates in as strings, so I would have to define a variable that opens the file and writes it in as an array (so that I can use indexing later with loop functions).

I am not sure which option would work better, or if there is a better third/fourth option that I have not considered.

CodePudding user response:

  1. you have some typos in your first code. return print() is weird combination and some indentation problem near the with statement.
  2. as mentioned your second option will have less memory consumption, however the data will be reachable on demand.

I think that you need to rethink what is your main target. if you just want to cast the data between different formats from file to file the second option is better. If you need to apply some logic on the data the first option (with high memory consumption) is the solution. You can also do something else, instead of reading all the data try to read it as chunks and work your way thru the file. Something like:

class ReadFile:
    def __init__(self, file_path):
        self.file_pipe = open(file_path, "r")
        self.number_of_lines_to_read = 1000

    def __del__(self):
        self.file_pipe.close()

    def get_next_cordinates(self):
        cnt = 0
        coords = []
        for line in self.file_pipe:
            cnt  = 1
            if cnt % self.number_of_lines_to_read == 0:
                yield np.array(coords, dtype=object)
                coords = []
            if "C" in line or "H" in line:
                atom, x, y, z = line.split(" ")
                coords.append([float(x), float(y), float(z)])
            else:
                pass
        yield np.array(coords, dtype=object)

and than you can use it as follows:

read_file = ReadFile(file_path)
for coords in read_file.get_next_cordinates():
    # do something with the coords
    pass
  • Related