How to skip lines when reading a file with numpy.fromfile?-CodePudding

I am reading a .pksc file that contains coordinates and velocities of a large number of astronomical objects. I am doing the reading with

import numpy as np
f=open('halos_10x10.pksc') 
data = np.fromfile(f,count=N*10,dtype=np.float32)

The file can be found here. It is very large and I want to skip the first m objects (the first m lines corresponding to these objects, if there are lines in the file). How can I do this, I see nowhere an option for skipping? Optionally, it would be also nice to be able to skip the last k objects from the file. Tnx!

CodePudding user response：

The first thing to appreciate is that your PKSC file is binary and is one continuous string of bytes with no discernible breaks in the data.

Text files on the other hand have lines that are clearly delimited by some line-break character, so it's pretty easy to read a line at a time and ignore M lines at the front, then read the remaining number of lines you care about: REMAINING_LINES = ALL_LINES - M_LINES - K_LINES.

np.fromfile() reads the binary file an item at a time.

To do this, it needs the dtype= param to tell the reader how big an item is. In the case of a PKSC file, we represent an item as a 32-bit integer, np.int32.

I searched and searched but could not find a spec for the file. Fortunately the link you provided has a sample Python script for reading the file; and I also found a well-documented Python lib for dealing with these kinds of files (websk.py, linked below).

I've learned that the PKSC file has the following properties:

The first 3 items are header items:
- the first header item is a count of the records of relevant data that follow the header items
each relevant data record contains 10 items

np.fromfile() also takes the count= param as an instruction of exactly how many items to read.

Here's how to read the 3 header items, get the total number of Halo records that follow, and read the first two records (10 items per record):

Nitems_per_record = 10

f = open('halos_10x10.pksc')

headers = np.fromfile(f, dtype=np.int32, count=3)
print(f'Headers: {headers}')
print(f'This file contains {headers[0]} records with Halo data')

record1 = np.fromfile(f, dtype=np.int32, count=Nitems_per_record)
print(f'First record:\n{record1}')

record2 = np.fromfile(f, dtype=np.int32, count=Nitems_per_record)
print(f'Second record:\n{record2}')

Headers: [2079516 2079516 2079516]
This file contains 2079516 records with Halo data
First record:
[ 1170060708 -1011158654 -1006515961 -1022926100  1121164875  1110446585  1086444250  1170064687 -1011110709 -1006510502]
Second record:
[ 1170083367 -1013908122 -1006498824 -1014626384 -1020456945 -1033004197  1084104229  1170090354 -1013985376 -1006510502]

According to websky.py, the 2nd and 3rd header items also have relevant values, maybe you care about these too? I've synthesized the following from that code:

RTHMAXin    = headers[1]
redshiftbox = headers[2]

Reading more than one record at a time requires re-shaping the data. To read 3 records:

f = open('halos_10x10.pksc')

np.fromfile(f, dtype=np.int32, count=3)  # reading, but ignoring header items

three_records = np.fromfile(f, dtype=np.int32, count=3*Nitems_per_record)
print(f'Initial:\n{three_records}')

reshaped_records = np.reshape(three_records, (3, Nitems_per_record))
print(f'Re-shaped:\n{reshaped}')

Initial:
[ 1170060708 -1011158654 -1006515961 -1022926100  1121164875  1110446585
  1086444250  1170064687 -1011110709 -1006510502  1170083367 -1013908122
 -1006498824 -1014626384 -1020456945 -1033004197  1084104229  1170090354
 -1013985376 -1006510502  1169622353 -1009409432 -1006678295 -1045415727
 -1017794908 -1051267742  1084874393  1169623221 -1009509109 -1006675510]
Re-shaped:
[[ 1170060708 -1011158654 -1006515961 -1022926100  1121164875  1110446585  1086444250  1170064687 -1011110709 -1006510502]
 [ 1170083367 -1013908122 -1006498824 -1014626384 -1020456945 -1033004197  1084104229  1170090354 -1013985376 -1006510502]
 [ 1169622353 -1009409432 -1006678295 -1045415727 -1017794908 -1051267742  1084874393  1169623221 -1009509109 -1006675510]]

So, what about skipping?

Just trim the reshaped data

The easiest thing to do would be read all the data and then trim from the front and back what you don't want:

m = 1
k = 1 * -1
trimmed_records = reshaped_records[m:k]
print(f'Trimmed:\n{trimmed_records}')

Trimmed:
[[ 1170083367 -1013908122 -1006498824 -1014626384 -1020456945 -1033004197  1084104229  1170090354 -1013985376 -1006510502]]

I'm not sure why you want to skip, but that's the easiest to understand and implement.

If you're conern is memory, read on.

Discard M records, read less `K M` records

The next option, as I see it, is to:

take the record count (A records) from the first header
read and ignore M records
figure out how many remaining records you have to read, given that you've already read M records and want to stop at record K: R = A - M - K

You'll only save a little memory by ignoring M records; the data is still read and interpreted. You'll definitely save on memory by not reading records K to the end:

f = open('halos_10x10.pksc')
headers = np.fromfile(f, dtype=np.int32, count=3)

Arecords = headers[0]
Mrecords = 1_000_000
Krecords = 1_000_000

Nitems = Mrecords * Nitems_per_record
np.fromfile(f, dtype=np.int32, count=Nitems)

Rrecords = Arecords - Mrecords - Krecords  # Remaining records to read
Nitems = Rrecords * Nitems_per_record
data = np.fromfile(f, dtype=np.int32, count=Nitems)
data = np.reshape(data, (Rrecords, Nitems_per_record))

print(f'From {Arecords} to {Rrecords} records:\n{data.shape}')

From 2079516 to 79516 records:
(79516, 10)

CodePudding user response：

If all you need is to chunk the big files into smaller files, so that you can operate on them independently:

import numpy as np

Nrecords_per_chunk = 100_000
Nitems_per_record = 10

f_in = open('halos_10x10.pksc', 'rb')
headers = np.fromfile(f_in, dtype=np.int32, count=3)

Nitems = Nrecords_per_chunk * Nitems_per_record

fnumber = 1
while True:
    items = np.fromfile(f_in, dtype=np.int32, count=Nitems)

    # Because at the end of the file, we're very likely to get less back than we asked for
    Nrecords_read = int(items.shape[0] / Nitems_per_record)

    # At End Of File: Weird luck, chunk_size was a perfect multiple of number of records
    if Nrecords_read == 0:
        break

    records = np.reshape(items, (Nrecords_read, Nitems_per_record))

    with open(f'halos_{fnumber}.pksc', 'wb') as f_out:
        # Keep same format by having 3 "header" items, each item's value is the record count
        new_headers = np.array([Nrecords_read]*3, dtype=np.int32)
        new_headers.tofile(f_out)
        records.tofile(f_out)

    # At End Of File
    if Nrecords_read < Nrecords_per_chunk:
        break

    fnumber  = 1

f_in.close()


# Test that first 100_000 records from the main file match the records from the first chunked file

f_in = open('halos_10x10.pksc')
np.fromfile(f_in, dtype=np.int32, count=3)
Nitems = Nrecords_per_chunk * Nitems_per_record
items = np.fromfile(f_in, dtype=np.int32, count=Nitems)
records_orig = np.reshape(items, (Nrecords_per_chunk, Nitems_per_record))
f_in.close()

f_in = open('halos_1.pksc')
np.fromfile(f_in, dtype=np.int32, count=3)
Nitems = Nrecords_per_chunk * Nitems_per_record
items = np.fromfile(f_in, dtype=np.int32, count=Nitems)
records_chunked = np.reshape(items, (Nrecords_per_chunk, Nitems_per_record))
f_in.close()

assert np.array_equal(records_orig, records_chunked)

So, what about skipping?

Just trim the reshaped data

Discard M records, read less K M records

Discard M records, read less `K M` records