Unwanted repeats of string elements in a list-CodePudding

I have to use a previous function (that returned a dictionary with sequence IDs as keys and a min-max molecular weight interval as values in the format of a tuple) to write a new function that, given a FASTA file name and a min-max molecular weight, returns a list of sequence IDs for the sequences with a molecular weight within the given interval. The function would return the sequence ID of an ambiguous sequence for which the weight interval overlaps the specified weight interval, and the ID of an unambiguous sequence of which the fixed weight falls into the specified interval.

The following function works perfectly and I get the right sequences, but instead of getting a list with each sequence ID once, I get a list with each ID from an ambiguous sequence repeated a lot of times… Probably because the ID occurs as many times as there are unambiguous sequences that can be formed out of that ambiguous one, since for an unambiguous sequence, the ID occurs only once.

def sequence_list(filename, min_mw, max_mw):
    with open(filename) as file:
        seq_dict = {}
        seq_list = []
        for record in SeqIO.parse(file, "fasta"):
            data = IUPACData.ambiguous_dna_values
            ambiguous_dna = list(map("".join, product(*map(data.get, record))))
            mol_weight = []
            for seq in ambiguous_dna:
                mol_weight.append(SeqUtils.molecular_weight(seq))
            tuple = (min(mol_weight),max(mol_weight))
            if min(mol_weight) != max(mol_weight):
                seq_dict[record.id] = (min(mol_weight), max(mol_weight))
            else:
                seq_dict[record.id] = min(mol_weight)
            for values in mol_weight:
                if min_mw <= values <= max_mw:
                    seq_list.append(record.id)
        print(seq_list)

The (beginning of the) result then looks something like this:

['seq_9143_unamb', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_1101', 'seq_504', 'seq_504', 'seq_504', 'seq_504', 'seq_504', 'seq_504', 'seq_504', 'seq_504', 'seq_504', 'seq_504', 'seq_504', 'seq_504', 'seq_504', 'seq_504', 'seq_504', 'seq_504', 'seq_504', 'seq_504', 'seq_4077', 'seq_4077', 'seq_4077', 'seq_4077', 'seq_4077', 'seq_4077', 'seq_4077', 'seq_4077', 'seq_4077', 'seq_4077', 'seq_4077', 'seq_4077', 'seq_4077', 'seq_4077', 'seq_4077',...]

Could anyone perhaps help me with this?

I’ve tried some duplicate-removal methods, but they all returned that my list “wasn’t iterable”, perhaps because it’s a list of strings and not numbers.

CodePudding user response：

If all you're looking for is to remove duplicates, then you can use dict.fromkeys method:

out = list(dict.fromkeys(lst).keys())

Output:

['seq_9143_unamb', 'seq_1101', 'seq_504', 'seq_4077']

CodePudding user response：

You're appending multiple times when finding a weight that matches, for each record. You're getting the duplicates because this check isn't breaking after finding a match. Change your last for loop to include a break like so:

for values in mol_weight:
    if min_mw <= values <= max_mw:
        seq_list.append(record.id)
        break