Home > other >  Problem in PSET6 finding strs in dna sequence
Problem in PSET6 finding strs in dna sequence

Time:01-16

i am having trouble in the section: # Find longest match of each STR in DNA sequence.

I dont understand why when i print(longest_str) i get all values equal to 0 {'AGATC': 0, 'AATG': 0, 'TATC': 0}

Am i calling the longest_match function wrong?

PD: I am new to programming and python, thanks for your help!!

import csv
import sys   

def main():
    # TODO: Check for command-line usage
    longest_str = {}
    if len(sys.argv) != 3:
        sys.exit("Usage: python dna.py, data.csv, sequence.txt")

    # TODO: Read database file into a variable
    with open(sys.argv[1]) as f:
        data = csv.DictReader(f)

    # TODO: Read DNA sequence file into a variable
    with open(sys.argv[2]) as f2:
        dna_sequence = csv.DictReader(f2)

    # TODO: Find longest match of each STR in DNA sequence
    subsequences = data.fieldnames[1:]
    for subsequence in subsequences:
        longest_str[subsequence] = longest_match(str(dna_sequence), subsequence)
    print(longest_str)

# TODO: Check database for matching profiles

    return


def longest_match(sequence, subsequence):
    """Returns length of longest run of subsequence in sequence."""

    # Initialize variables
    longest_run = 0
    subsequence_length = len(subsequence)
    sequence_length = len(sequence)

    # Check each character in sequence for most consecutive runs of subsequence
    for i in range(sequence_length):

        # Initialize count of consecutive runs
        count = 0

        # Check for a subsequence match in a "substring" (a subset of characters) within sequence
        # If a match, move substring to next potential match in sequence
        # Continue moving substring and checking for matches until out of consecutive matches
        while True:

            # Adjust substring start and end
            start = i   count * subsequence_length
            end = start   subsequence_length

            # If there is a match in the substring
            if sequence[start:end] == subsequence:
                count  = 1

            # If there is no match in the substring
            else:
                break

        # Update most consecutive matches found
        longest_run = max(longest_run, count)

    # After checking for runs at each character in seqeuence, return longest run found
    return longest_run


main()

CodePudding user response:

The dna sequence is not a csv file. dna_sequence = csv.DictReader(f2)

dna_sequence is a dictreader object here. The longest_match function provided by cs50 won't know what to do with it. It needs a string.

CodePudding user response:

To clarify what @Fuelled_By_Coffee said, csv.DictReader() returns a dictreader object. It is used to iterate over rows in the CSV file, returning a dictionary for each row of data. So, data and dna_sequence are dictreader objects, NOT the contents of each file.

A dictreader object is appropriate to read the CSV file. However, you're not done reading that file. Before you start checking DNA sequences, you need to read all of the data from the CSV file into memory. My advice: Get this working first, BEFORE you work on the rest of the code.

Regarding the dna_sequence data, these files aren't appropriate for dictreader. This object expects a header row with field names. To see what I mean, compare the contents of sequence\1.txt to databases\small.csv. Notice how the CSV has a header line, and the sequence file doesn't? You need a different Python method to read the sequence files.

  •  Tags:  
  • Related