I am reading a .pksc
file that contains coordinates and velocities of a large number of astronomical objects. I am doing the reading with
import numpy as np
f=open('halos_10x10.pksc')
data = np.fromfile(f,count=N*10,dtype=np.float32)
The file can be found here. It is very large and I want to skip the first m
objects (the first m
lines corresponding to these objects, if there are lines in the file). How can I do this, I see nowhere an option for skipping? Optionally, it would be also nice to be able to skip the last k
objects from the file. Tnx!
CodePudding user response:
The first thing to appreciate is that your PKSC file is binary and is one continuous string of bytes with no discernible breaks in the data.
Text files on the other hand have lines that are clearly delimited by some line-break character, so it's pretty easy to read a line at a time and ignore M lines at the front, then read the remaining number of lines you care about: REMAINING_LINES = ALL_LINES - M_LINES - K_LINES
.
np.fromfile()
reads the binary file an item at a time.
To do this, it needs the dtype=
param to tell the reader how big an item is. In the case of a PKSC file, we represent an item as a 32-bit integer, np.int32
.
I searched and searched but could not find a spec for the file. Fortunately the link you provided has a sample Python script for reading the file; and I also found a well-documented Python lib for dealing with these kinds of files (websk.py, linked below).
I've learned that the PKSC file has the following properties:
- The first 3 items are header items:
- the first header item is a count of the records of relevant data that follow the header items
- each relevant data record contains 10 items
np.fromfile()
also takes the count=
param as an instruction of exactly how many items to read.
Here's how to read the 3 header items, get the total number of Halo records that follow, and read the first two records (10 items per record):
Nitems_per_record = 10
f = open('halos_10x10.pksc')
headers = np.fromfile(f, dtype=np.int32, count=3)
print(f'Headers: {headers}')
print(f'This file contains {headers[0]} records with Halo data')
record1 = np.fromfile(f, dtype=np.int32, count=Nitems_per_record)
print(f'First record:\n{record1}')
record2 = np.fromfile(f, dtype=np.int32, count=Nitems_per_record)
print(f'Second record:\n{record2}')
Headers: [2079516 2079516 2079516]
This file contains 2079516 records with Halo data
First record:
[ 1170060708 -1011158654 -1006515961 -1022926100 1121164875 1110446585 1086444250 1170064687 -1011110709 -1006510502]
Second record:
[ 1170083367 -1013908122 -1006498824 -1014626384 -1020456945 -1033004197 1084104229 1170090354 -1013985376 -1006510502]
According to websky.py, the 2nd and 3rd header items also have relevant values, maybe you care about these too? I've synthesized the following from that code:
RTHMAXin = headers[1]
redshiftbox = headers[2]
Reading more than one record at a time requires re-shaping the data. To read 3 records:
f = open('halos_10x10.pksc')
np.fromfile(f, dtype=np.int32, count=3) # reading, but ignoring header items
three_records = np.fromfile(f, dtype=np.int32, count=3*Nitems_per_record)
print(f'Initial:\n{three_records}')
reshaped_records = np.reshape(three_records, (3, Nitems_per_record))
print(f'Re-shaped:\n{reshaped}')
Initial:
[ 1170060708 -1011158654 -1006515961 -1022926100 1121164875 1110446585
1086444250 1170064687 -1011110709 -1006510502 1170083367 -1013908122
-1006498824 -1014626384 -1020456945 -1033004197 1084104229 1170090354
-1013985376 -1006510502 1169622353 -1009409432 -1006678295 -1045415727
-1017794908 -1051267742 1084874393 1169623221 -1009509109 -1006675510]
Re-shaped:
[[ 1170060708 -1011158654 -1006515961 -1022926100 1121164875 1110446585 1086444250 1170064687 -1011110709 -1006510502]
[ 1170083367 -1013908122 -1006498824 -1014626384 -1020456945 -1033004197 1084104229 1170090354 -1013985376 -1006510502]
[ 1169622353 -1009409432 -1006678295 -1045415727 -1017794908 -1051267742 1084874393 1169623221 -1009509109 -1006675510]]
So, what about skipping?
Just trim the reshaped data
The easiest thing to do would be read all the data and then trim from the front and back what you don't want:
m = 1
k = 1 * -1
trimmed_records = reshaped_records[m:k]
print(f'Trimmed:\n{trimmed_records}')
Trimmed:
[[ 1170083367 -1013908122 -1006498824 -1014626384 -1020456945 -1033004197 1084104229 1170090354 -1013985376 -1006510502]]
I'm not sure why you want to skip, but that's the easiest to understand and implement.
If you're conern is memory, read on.
Discard M records, read less K M
records
The next option, as I see it, is to:
- take the record count (
A
records) from the first header - read and ignore
M
records - figure out how many remaining records you have to read, given that you've already read
M
records and want to stop at recordK
:R = A - M - K
You'll only save a little memory by ignoring M
records; the data is still read and interpreted. You'll definitely save on memory by not reading records K
to the end:
f = open('halos_10x10.pksc')
headers = np.fromfile(f, dtype=np.int32, count=3)
Arecords = headers[0]
Mrecords = 1_000_000
Krecords = 1_000_000
Nitems = Mrecords * Nitems_per_record
np.fromfile(f, dtype=np.int32, count=Nitems)
Rrecords = Arecords - Mrecords - Krecords # Remaining records to read
Nitems = Rrecords * Nitems_per_record
data = np.fromfile(f, dtype=np.int32, count=Nitems)
data = np.reshape(data, (Rrecords, Nitems_per_record))
print(f'From {Arecords} to {Rrecords} records:\n{data.shape}')
From 2079516 to 79516 records:
(79516, 10)
CodePudding user response:
If all you need is to chunk the big files into smaller files, so that you can operate on them independently:
import numpy as np
Nrecords_per_chunk = 100_000
Nitems_per_record = 10
f_in = open('halos_10x10.pksc', 'rb')
headers = np.fromfile(f_in, dtype=np.int32, count=3)
Nitems = Nrecords_per_chunk * Nitems_per_record
fnumber = 1
while True:
items = np.fromfile(f_in, dtype=np.int32, count=Nitems)
# Because at the end of the file, we're very likely to get less back than we asked for
Nrecords_read = int(items.shape[0] / Nitems_per_record)
# At End Of File: Weird luck, chunk_size was a perfect multiple of number of records
if Nrecords_read == 0:
break
records = np.reshape(items, (Nrecords_read, Nitems_per_record))
with open(f'halos_{fnumber}.pksc', 'wb') as f_out:
# Keep same format by having 3 "header" items, each item's value is the record count
new_headers = np.array([Nrecords_read]*3, dtype=np.int32)
new_headers.tofile(f_out)
records.tofile(f_out)
# At End Of File
if Nrecords_read < Nrecords_per_chunk:
break
fnumber = 1
f_in.close()
# Test that first 100_000 records from the main file match the records from the first chunked file
f_in = open('halos_10x10.pksc')
np.fromfile(f_in, dtype=np.int32, count=3)
Nitems = Nrecords_per_chunk * Nitems_per_record
items = np.fromfile(f_in, dtype=np.int32, count=Nitems)
records_orig = np.reshape(items, (Nrecords_per_chunk, Nitems_per_record))
f_in.close()
f_in = open('halos_1.pksc')
np.fromfile(f_in, dtype=np.int32, count=3)
Nitems = Nrecords_per_chunk * Nitems_per_record
items = np.fromfile(f_in, dtype=np.int32, count=Nitems)
records_chunked = np.reshape(items, (Nrecords_per_chunk, Nitems_per_record))
f_in.close()
assert np.array_equal(records_orig, records_chunked)