The problem is to count the longest number of consecutive dna gene sequences in the seq.txt
file and store it in a dictionary. The dictionary should store the corresponding key value pairs for the dna. For Example, AGAT:5
. This means that AGAT occurred five consecutive times in the sample and was the maximum number of consecutive occurrences for any gene sequence. The program should then compare the values from the dna dictionary with the csv file to check to see whose dna it is.
Here is the the seq.txt file:
TCTATTCTTTGAGGATACGCTCGGCCTAGGCGGGGCTAATGGAAGCCAGGCTAATCCGATGTTGCGGTGCACCTCGATACCGTTCTAAAATATCACATCAACGCGCTCCAGTTGTGTGCCAAGGCCCGCTGAAGAGCAATGGAGCACCTACCCGGCCTTCTAACGCTGTCTAAAACTCCAAGCGAATTGCAGATTTTGGTTAGGACCCGTTTAATCTGTGGGCTTTGGTACTATGCAACCAATGGAACCGGTCGGACTCTGATCAGTCCCGACTGACAGGTCTCAAGTAGTTTGCTTACACGTTCTGACCCCCGTGCGCACCGTTGGGCGTACAGCGGTTCGGTCTATGGAATCAAGGAAAATCATTCGTATGGGGACGTAGTCACATAACAGCTGCAGGGAACTATGGAGATGACGAGGGGTCGTTTAGTGGAACGTCAAATGTCCTAACTGGTTCTGAGCTGTCTGGAACGTTGCAGTCAACGTCTACGATCTGGATTCTACAGTCTAGGCGTTCCAAGGGGCACCAGTAAGCTAAGTTGTTTAAATATGGCGGGTGTCGAAATGACGTCCAAAATCGCAAATAAGACAGATAGCAGGGGTGCAACTTAGGTATCTAAGGTAACTCTGACATACCTCATACAACTATCGAACAGTGGATTCCTTGTCGTCCTGTTGTAAACAGTTCAAGTCGGTACATGTTAGCGGGTGGTTTGGACGAGTATACAGGACCTGGCCTACACGGAATGTTTTAGATTCTATGTCCGGCGGGGACATCGCGTGCCGCTAGGATATAATTGGATTGTGGGAAGAATTTGGCCGGATTTTTGGCCTAGACTCGCGCTTCAGACCATACCGTGCGATCAGCACGATTGCTGACAAGCGTCGGTATTAAAGCAGGCTCCTTCCCAGCCAAACTAACCCAACGAAGACATCATGTTTCGCCGAAGTATCTTTGGGAGATGGGCGAATTAATCGCTTAGCGTGGCCGACTTGGGGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGGTTTAAGGGACTTATCCGACCAGAGGGGCAGTTACTTGTGGCGGTCACACGCCAGGACGAGTCTGTTCTTGCTGTGCGTAGATTAGGCTTGATCTGTGACTACAGGCGAATAGTAGGTGTGGGAAACAGAGGGGGGAGCAATGTGATCCCGGGGGGAGTGCTTCCTATACCTCGGTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGGATTATCCCCACCCAATGATCCGATCGCAAGCCTTAATACCATGGCACACCTCTCAACCTACTGATCTTCCATCCGTTTAACCCAGCACTAAGCTGCTCAGTGGTCACACTATGTTCAAGCTTCCGTGACGTGGGATCCTGGGGTCTTCGCAAGGCTAGTTTTGACCATTATCGACGACCGTCACCCTGTGACTGGTTCCCAACAAGGTGTCAAGTTCTAGCCCGTACCTGCAATCGGGAACCTCCGGTGCTTCATGAACCATGGATATAGGAATTATTGGTCTCCTCTCGCGTAGGTAGCGCGAATACCCCCAAGATGACACACTGTGGTGAACTTTGAGGACTCCCAGAAGGGTGACGGGTTATGTGGTTACGCGAAGTCGGCGTATCCACCGCCTAATTTTAAATTCAGCTCGAGCGACACGCGCGCTTCCTGGAAACGTTAGACGGGAAAAACCCCGCCCGAGAATGCGGGTTCCGCGGCCCACTAGGGGGCCCCCCAAGGATCTGACCGCGTATAAGCAATGCACAGCTGTACCATTTCAAATAGGACAGATAGTACCCCCACCGTGACTCGGCCTCAGATAATGGAATACGACCTGGTGACGGCGGTAGGGGTTCTATCTCAGGTATTCAGAGGGTGCATCCAGGTGATTCGTCACGTCCCGATTTCGACCCCACCACAGGATTTGTGCGATGGTAGTCTTGATGCTGTTTGCAGGCGGCCAAGCATCTAGGAGATGCCTCACTGCGCGAGATGAACCGGCGTTTCACAAGGGGACGCCAGGCCTTGCCGTCTCCATAAACCACGAGAAGGTATCGAACGTCAAACGGATAAATGCCGCGATACCGCTCGTTTCGAAGCGGCACTTCGATGGAAATGAGTAGTATGGCCTCGCCACACGACTACTCATCGGCTTGCGCTGACATCAATCCTGGCTGGCTTGAGGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGCTCCATAGGAAGGTGCGGGATAGCGGACAGCTAATCGGACAGAAGGGCCAGCTTGCACTCTCCTATAATTAGCAAGCGCCATACAATTGTAATCACGTATAAAATACAGCTACGTAAGTAATAGAGAGGCTCCCGGACTGTCCGGCGTCCCGCCAGTCTCGTACCAGGAGGTGGGATGGTAGGCAAACGAGCCTACTAGAATTGGGCCACCCTGTGAATAATATGCAGAGGCAACTACAGACGTCCGTCACCTGCCTAGAATCGAGTTCATTGACGGTGGGATATGCTCCGTTACCTGACTGTAGTTCGACTTTGTGGTGCGCACATAACGAGTGTCTACGATGCACAAAGTGTGAGCAAATTAGGAGTGTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTGCCGAGATGTTGGCGGGAAGTGTACGGCTTTGCGTCGTCGAGTGCTACGCAGTGTGCTACACTCCCGCAGCTGAGGCTAGGGCCCGAAACTAGACATTTTTTCTTTTGGCACTTCGTTCCGTATAATGAGTTCCCTCAATTCCCCGTCCGCAAGCCTCAGGATTACAATTAATTATACGGTTAAAGTTGGCTGCCAAGCCCGTTATTGACCGGTACCTGAGTCGAGGGGGGGTTGGGGATAGGCAATTATAGTATTCACTCACAGGACGCTCAGTAATGCCGCCGTTGTACTTCACGTAAGGGCCACAGTTTTTCTACCACAGAGGATGATCTGAGGACAGCGGTGCGTGAAGCCCGCTATTCAGGACACCCTCGAAACCCGTGGTTCACAGACAAAAAATTCGCCGCGGAAGCTGTTGCCCCTATGCCCCGGGTCAGCAAGGAGTCTGGATTTTATTCCAAGACTGCGTCTTTATTTTCTGGTGAGTATGAAATGACTCTGAGAAAATGGTCGAACCACGAGCTAGCTACAGCCACAGTCCGCTCAACTAACTTACCTCTACTCTAACAGTTACACGGCTTCCCGTTTTATGGGAAGAAGCACCTGTTCCTTTCCCAAGCCCCTTATAGCAGAGGTTGGTATTCGGTTGATTTGGAATAGTTAAACAGCGGCTATTTTGTAATCACTTTCCAGTCGGTAAGACATTCGAACCTCGTTTTGACGCTGCTCGCCATCGCGTTCGACTAGGAGTATTCCACTTTTCGGAGAGATGATTACTCATGACGCGGGGAACTCCATGGCTGTCATGCAGGATCTGGGCTAAATAAGATTAGATGTTCAACTGTCGTATACTTACTGCTACCAGCGGTGCTAGGCCCAGGACCCGCCATACCTGGCTATTGATCACTCTACCAGATGTCTCTTGACGAGTTACGAATTGCTGGGTGCTCTTGGAGACGAGTTGAGTCCGTAGTCGTGGCTGGGGAACGGGCGAGTTCGTACGTACCGTTTCAAAGCCCCACGAACCCAACCTCTTAGCCTTAACCCCACATTAGATACCCAAGTTGCATGACGCATTATGCGAGTACGACACTGGTATCGGCTGATCCGTCACTGCTCAAAGTCCAGTGGTTTCCTTATCTCGGGCTGGAAAGTGTAGCTTGTTCCAAACCTTCGAGAGGTTGATCGATGACCGGTTCTCACACACATCTTGCGGAGGGATGCTTGCGATGTGGCTTTACGTCCACCGACGGGCCGACTAGCTGGAAATCACAAACCCCTGCTCCGATAAGGTATTCTCGTTGACTTAGGGTAAACAAAATGCCCGTTACGTCCTAACCGAGTTTCCGGGCCTTCACTACCCGCGAGGGATGTGTAGTGGGGCCATTTACCTAAGCAGATGTACACCGAGTTACGATAGTCACATGGCCATTCAAAGCGTCTCACATAATCGATCGATAGATGATGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCCCGGAGGAAGCTGCGATTGGAATGCGGCTAACTTCGCTCTGCAACATTCTTGGCAGACGGCCCCAATGGCGTAATTTAGGCGTGTGTACCTAAAGTGGTCTACTCCTATGAACCGAATCGCGGGATAAATCGAGTTGGGACTGCTTTGCCTTAATTACATTCACTGATTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGGTCGAGCACGGCTGGCACGTCCGGCTCCATCGCGTCGTAATCCATCCCTATTCGACCAACAAAACCTCAGGGGACGGGATGTGAGTGGGTATCGATCATTATCGAACGCCCATAAGTACTCCACTCATCTGTCTGAAAAGTTTGTCGAGTGCCGCTCTCTGAAGAGTACGATAACTTACTCCAAACACTCTACGCCTAGTGGTCGAAAACACTAAAGGGAAAATACTCACTGACTTACTCTGTCGCTCTACGATTGCCGCGATACCTTAATAAGACACGTATCGGCTGTCGCAGCGATGGATTCCTTAAGCGATACAACTAAGATCAATCGGTGCCGGGCCTACAGCCTGGGCCCTAGCTCCAAAAGTGATAATGGATAGTCGGTTCAAGCGAATTTACACCAGACTGATCCTTTACGGTCATTCCGACCGCCGCATGATACATGCCAAAAGACACTTGTCTTCTTTCCTCTAAAAGACAGACCTTGTTTGCAAGGAGAGCCCAATCGGCACGACCCAAAGGGATTATCAACTGAACTATTATTGCATACTACTAAGCAGACGGACCGTATAGCATCATTGATACCTATTATATTTCCATACACCAACTCCATACGCGATGGGTCGAAACTACAAGCTTCACTTACGTGTACAGCCGCAGGACCCACTCTCTAATCTAGCCAATGACACTACTAATTTGAACATTCCCCAGCGATGAACAGGCACATGAGCGGTCCTCGTACCCACCACGGCCCGCTCAACTGCAAGGGGCCGCTCGGATCAAAGTTTTTCACTAACTCATGTCGAGCAGATCGGCATGCTCAAGATAGTATTTTAGGAGG
Here is copy of the csv file
name,AGATC,TTTTTTCT,AATG,TCTAG,GATA,TATC,GAAA,TCTG
Albus,15,49,38,5,14,44,14,12
Cedric,31,21,41,28,30,9,36,44
Draco,9,13,8,26,15,25,41,39
Fred,37,40,10,6,5,10,28,8
Ginny,37,47,10,23,5,48,28,23
Hagrid,25,38,45,49,39,18,42,30
Harry,46,49,48,29,15,5,28,40
Hermione,43,31,18,25,26,47,31,36
James,46,41,38,29,15,5,48,22
Kingsley,7,11,18,33,39,31,23,14
Lavender,22,33,43,12,26,18,47,41
Lily,42,47,48,18,35,46,48,50
Lucius,9,13,33,26,45,11,36,39
Luna,18,23,35,13,11,19,14,24
Minerva,17,49,18,7,6,18,17,30
Neville,14,44,28,27,19,7,25,20
Petunia,29,29,40,31,45,20,40,35
Remus,6,18,5,42,39,28,44,22
Ron,37,47,13,25,17,6,13,35
Severus,29,27,32,41,6,27,8,34
Sirius,31,11,28,26,35,19,33,6
Vernon,26,45,34,50,44,30,32,28
Zacharias,29,50,18,23,38,24,22,9
e
Here is a copy of my code
import csv
import sys
def main():
# TODO: Check for command-line usage
# TODO: Read database file into a variable
with open("dna.csv","r") as csv_file:
csv_reader = csv.reader(csv_file)
header = next(csv_reader)
# TODO: Read DNA sequence file into a variable
fil = open("seq.txt","r")
sequence = fil.read()
fil.close()
# TODO: Find longest match of each STR in DNA sequence
countsequence = {}
numberstr = 0
for r in header[1:]:
times = longest_match(sequence,r)
countsequence[r] = times
numberstr = 1
# TODO: Check database for matching profiles
checkstr = 0
with open("dna.csv" , "r") as csvfile:
csvreader = csv.DictReader(csvfile)
for rows in csvreader:
for key in countsequence:
if rows[key] != countsequence[key]:
break
elif rows[key] == countsequence[key]:
checkstr = 1
if checkstr == numberstr:
print(rows["name"])
return
print(checkstr)
def longest_match(sequence, subsequence):
"""Returns length of longest run of subsequence in sequence."""
# Initialize variables
longest_run = 0
subsequence_length = len(subsequence)
sequence_length = len(sequence)
# Check each character in sequence for most consecutive runs of subsequence
for i in range(sequence_length):
# Initialize count of consecutive runs
count = 0
# Check for a subsequence match in a "substring" (a subset of characters) within sequence
# If a match, move substring to next potential match in sequence
# Continue moving substring and checking for matches until out of consecutive matches
while True:
# Adjust substring start and end
start = i count * subsequence_length
end = start subsequence_length
# If there is a match in the substring
if sequence[start:end] == subsequence:
count = 1
# If there is no match in the substring
else:
break
# Update most consecutive matches found
longest_run = max(longest_run, count)
# After checking for runs at each character in seqeuence, return longest run found
return longest_run
main()
I was expecting the code to print out the person dna name .
CodePudding user response:
You have a subtle error in the dictionary you create with csvreader. The values are read as strings not ints. You can see this if you print rows
:
OrderedDict([('name', 'Albus'), ('AGATC', '15'), ('TTTTTTCT', '49'), ('AATG', '38'), ('TCTAG', '5'), ('GATA', '14'), ('TATC', '44'), ('GAAA', '14'), ('TCTG', '12')])
So, when you compare rows[key] != countsequence[key]
for 'AGATC', you are testing 'a_str' != 18
which will be False even when rows['AGATC'] is 18
. Either rows[key]
needs to be converted to an int or countsequence[key]
to a string. It should work (for this example) once you fix this.
Also, you have another (more subtle) error with the value of checkstr
. You initialize checkstr = 0
outside the csvreader loop then increment inside the countsequence loop. This only works when there are no gene matches for other people in the dna set. However, what happens if 1 person matches 1 gene (only). Answer: you will increment checkstr on that person, and not reset to 0 for the next person. Try modifying 'Albus' to have 'AGATC', 18 and see if it works. Good luck.