I need to make a Python script that creates a FASTA file containing all of records from all of the .fa files from /resources/pvalb/. The script should be able to take a list of files and output the contents of all of them (essentially mimicking the cat command from bash)
Here's what I have so far as an example:
> import sys
>
> filenames = sys.argv[1:]
>
> for filename in filenames:
> for line in open(filename):
> line = line.rstrip("\n")
> print(line)
CodePudding user response:
Here are some things to consider for your script:
- The files might contain binary data or an unrecognized text encoding
- The files may be quite large
- They may even be larger than the memory available on your system
- It's even possible that one "line" would be larger than the memory on your system
To solve these problems, it's best to open the files in binary mode with 'b'
, and then read them in chunks.
Here's an example doing so in chunks of up to 4 KiB each:
import sys
for path in sys.argv[1:]:
with open(path, 'rb') as file:
while data := file.read(4096):
sys.stdout.buffer.write(data)
Some more tips:
- Most file systems, most SSD, and newer HDD all store data in 4 KiB chunks, making reading and writing in 4 KiB chunks usually a very efficient way to go
- For very large files, especially in "slow" languages like Python, you can get faster performance by increasing these chunks to 1 MiB or even 10 MiB
- 1 MiB and 10 MiB chunks are common on big, distributed file systems like Lustre and BeeGFS, though these file systems can have a wide variety of block sizes.