Th aim is to build a graph from a collection of stings (reads) in a FASTQ file. First, we implement the following function that gets the reads. We remove the new line character from the end of each line (with str.strip()), and for convention, we convert all characters in the reads to uppper case (with str.upper()). The code for that:
def get_reads(filePath):
reads = list() # The list of strings that will store the reads (the DNA strings) in the FASTQ file at filePath
fastqFile = open(filePath, 'r')
fastqLines = fastqFile.readlines()
fastqFile.close()
for lineIndex in range(1, len(fastqLines), 4): # I want this explained
line = fastqLines[lineIndex]
reads.append(line.strip().upper())
return reads
Explain what is the purpose of the line for lineIndex in range(1, len(fastqLines), 4)?
We use this to make a de Bruijn graph from a collection of strings. Can someone explain, please?
CodePudding user response:
fastqLines
is a Python List of each line read from the file. The loop from
for lineIndex in range(1, len(fastqLines), 4):
produces a value of lineIndex
of 1, 5, 9 ... to the size of the List. This value is then used to store the selected lines in another List reads
. Because Python Lists are indexed from 0, this all means that the 2nd, 6th, 10th lines from the file are stored in reads