Home > OS >  deleting repeating value in nested dictionary
deleting repeating value in nested dictionary

Time:10-16

I know my question is pretty basic, but the answers on the Internet somehow didn't work. I have a rather long nested dictionary where some values are repeating. Here is a sample slice of my dictionary:

 {'C4QY10_e': {'protein accession': ['C4QY10_e',
   'C4QY10_e',
   'C4QY10_e',
   'C4QY10_e',
   'C4QY10_e'],
  'sequence length': ['1879', '1879', '1879', '1879', '1879'],
  'analysis': ['Pfam', 'Pfam', 'Pfam', 'Pfam', 'Pfam'],
  'signature accession': ['PF18314',
   'PF02801',
   'PF18325',
   'PF00109',
   'PF01648'],
  'signature description': ['Fatty acid synthase type I helical domain',
   'Beta-ketoacyl synthase',
   'Fatty acid synthase subunit alpha Acyl carrier domain',
   'Beta-ketoacyl synthase',
   "4'-phosphopantetheinyl transferase superfamily"],
  'start location': ['328', None, '139', None, '1761'],
  'stop location': ['528', None, '300', None, '1861'],
  'e-value': ['4.7E-73', None, '1.3E-72', None, '1.4E-18'],
  'interpro accession': ['IPR041550', None, 'IPR040899', None, 'IPR008278'],
  'interpro description': ['Fatty acid synthase type I',
   None,
   'Fatty acid synthase subunit alpha',
   None,
   "4'-phosphopantetheinyl transferase domain"],
  'nunique': [1, 1, 1, 1, 1],
  'domain_count': [5, 5, 5, 5, 5]},

As you can see values are repeating and also in some of the keys there are None values. How can I fix it?

CodePudding user response:

dictionary = {
    'C4QY10_e':
        {'protein accession':
         ['C4QY10_e',
          'C4QY10_e',
          'C4QY10_e',
          'C4QY10_e',
          'C4QY10_e'],
         'sequence length': ['1879', '1879', '1879', '1879', '1879'],
         'analysis': ['Pfam', 'Pfam', 'Pfam', 'Pfam', 'Pfam'],
         'signature accession': ['PF18314',
                                 'PF02801',
                                 'PF18325',
                                 'PF00109',
                                 'PF01648'],
         'signature description': ['Fatty acid synthase type I helical domain',
                                   'Beta-ketoacyl synthase',
                                   'Fatty acid synthase subunit alpha Acyl carrier domain',
                                   'Beta-ketoacyl synthase',
                                   "4'-phosphopantetheinyl transferase superfamily"],
         'start location': ['328', None, '139', None, '1761'],
         'stop location': ['528', None, '300', None, '1861'],
         'e-value': ['4.7E-73', None, '1.3E-72', None, '1.4E-18'],
         'interpro accession': ['IPR041550', None, 'IPR040899', None, 'IPR008278'],
         'interpro description': ['Fatty acid synthase type I',
                                  None,
                                  'Fatty acid synthase subunit alpha',
                                  None,
                                  "4'-phosphopantetheinyl transferase domain"],
         'nunique': [1, 1, 1, 1, 1],
         'domain_count': [5, 5, 5, 5, 5]},
}


def remove_repeating_value(dictionary):
    for _ , value in dictionary.items():
        for key1, value1 in value.items():
            if value1[0] == value1[1] or value1[1] is None:
                value[key1] = value1[0]
            elif value1[0] is None:
                value[key1] = value1[1]
            else:
                value[key1] = value1
    return dictionary


# calling function
print(remove_repeating_value(dictionary))

output

dictionary = {
    'C4QY10_e':
        {'protein accession': ['C4QY10_e'],
            'sequence length': ['1879'],
            'analysis': ['Pfam'],
            'signature accession': ['PF18314', 'PF02801', 'PF18325', 'PF00109', 'PF01648'],
            'signature description': ['Fatty acid synthase type I helical domain',
                                        'Beta-ketoacyl synthase',
                                        'Fatty acid synthase subunit alpha Acyl carrier domain',
                                        'Beta-ketoacyl synthase',
                                        "4'-phosphopantetheinyl transferase superfamily"],
            'start location': ['328', '139', '1761'],
            'stop location': ['528', '300', '1861'],
            'e-value': ['4.7E-73', '1.3E-72', '1.4E-18'],
            'interpro accession': ['IPR041550', 'IPR040899', 'IPR008278'],
            'interpro description': ['Fatty acid synthase type I',
                                        'Fatty acid synthase subunit alpha', 
                                        "4'-phosphopantetheinyl transferase domain"],
            'nunique': [1],
            'domain_count': [5]}
}

CodePudding user response:

You can remove duplicates and None values from the dictionary like this

for k,v in d['C4QY10_e'].items():
    d['C4QY10_e'][k] = list(set(filter(None, v)))

Output

{'C4QY10_e': {'protein accession': ['C4QY10_e'],
  'sequence length': ['1879'],
  'analysis': ['Pfam'],
  'signature accession': ['PF18325',
   'PF18314',
   'PF00109',
   'PF01648',
   'PF02801'],
  'signature description': ['Beta-ketoacyl synthase',
   'Fatty acid synthase type I helical domain',
   'Fatty acid synthase subunit alpha Acyl carrier domain',
   "4'-phosphopantetheinyl transferase superfamily"],
  'start location': ['1761', '139', '328'],
  'stop location': ['300', '528', '1861'],
  'e-value': ['1.3E-72', '4.7E-73', '1.4E-18'],
  'interpro accession': ['IPR008278', 'IPR041550', 'IPR040899'],
  'interpro description': ['Fatty acid synthase type I',
   "4'-phosphopantetheinyl transferase domain",
   'Fatty acid synthase subunit alpha'],
  'nunique': [1],
  'domain_count': [5]}}

There are no keys that are repeated, keys are repeated at different levels.

{'C4QY10_e': {'analysis': ['Pfam', 'Pfam', 'Pfam', 'Pfam', 'Pfam'],
              'domain_count': [5, 5, 5, 5, 5],
              'e-value': ['4.7E-73', None, '1.3E-72', None, '1.4E-18'],
              'interpro accession': ['IPR041550',
                                     None,
                                     'IPR040899',
                                     None,
                                     'IPR008278'],
              'interpro description': ['Fatty acid synthase type I',
                                       None,
                                       'Fatty acid synthase subunit alpha',
                                       None,
                                       "4'-phosphopantetheinyl transferase "
                                       'domain'],
              'nunique': [1, 1, 1, 1, 1],
              'protein accession': ['C4QY10_e',
                                    'C4QY10_e',
                                    'C4QY10_e',
                                    'C4QY10_e',
                                    'C4QY10_e'],
              'sequence length': ['1879', '1879', '1879', '1879', '1879'],
              'signature accession': ['PF18314',
                                      'PF02801',
                                      'PF18325',
                                      'PF00109',
                                      'PF01648'],
              'signature description': ['Fatty acid synthase type I helical '
                                        'domain',
                                        'Beta-ketoacyl synthase',
                                        'Fatty acid synthase subunit alpha '
                                        'Acyl carrier domain',
                                        'Beta-ketoacyl synthase',
                                        "4'-phosphopantetheinyl transferase "
                                        'superfamily'],
              'start location': ['328', None, '139', None, '1761'],
              'stop location': ['528', None, '300', None, '1861']}}

You can check on the level here.

And to remove the None values from the list you can do this,

In [1]: list(filter(None, ['328', None, '139', None, '1761']))
Out[1]: ['328', '139', '1761']

CodePudding user response:

Since some lists contain duplicates you can use this code to construct an ordered list

ordered_list = list(dict.fromkeys(duplicated_list))

CodePudding user response:

I think, you want to avoid duplication and None values in arrays. For this try the function:

def remove_duplication_from_arr_of_nested_dict(input_dict):
   for key, value in input_dict.items():
      if type(value) is dict:
          input_dict[key] = remove_duplication_from_arr_of_nested_dict(value)
      elif type(value) is list:
          value = set(value)
          if None in value:
              value.remove(None)
          input_dict[key] = list(value)
      return input_dict

Just pass your dictionary to the function, it will remove the duplicates and Nones from lists. If you want to remove duplicated keys at different levels, use same approach to list down all the existing keys and check whether key is in the list or not and delete accordingly.

CodePudding user response:

You can convert a list to a set then back to a list to remove duplicates. However, you need to bear in mind that the original order of the list may not be retained.

Removal of None values can be dealt with in the same list comprehension that's used to reconstruct the list.

def fix(d):
    for v in d.values():
        if isinstance(v, dict):
            fix(v)
        elif isinstance(v, list):
            v[:] = [e for e in set(v) if e is not None]
    return d

print(fix(data))

If keeping order in the lists is important then:

def fix(d):
    for v in d.values():
        if isinstance(v, dict):
            fix(v)
        elif isinstance(v, list):
            v[:] = list({e: None for e in v if e is not None})
    return d

print(fix(data))
  • Related