create a list inside nested dictionary-CodePudding

I have a long list of keys (main keys) in the nested dictionary. In one of the sub-keys, I want to create a list of its values. This is one of the records from my nested dictionary. They all are structured in a similar manner.

    {'C4QY10_e': 
{'protein accession': 'C4QY10_e', 
'sequence length': [1879], 
'analysis': 'Pfam', 
'signature accession': 'PF18314, PF02801, PF18325, PF00109, PF01648', 
'signature description': "Fatty acid synthase type I helical domain, Beta-ketoacyl synthase, Fatty acid synthase subunit alpha Acyl carrier domain, 4'-phosphopantetheinyl transferase superfamily", 'start location': [328, 139, 1761], 
'stop location': [528, 300, 1861], 
'e-value': [4.7e-73, 1.3e-72, 1.4e-18], 
'interpro accession': 'IPR041550, IPR040899, IPR008278', 
'interpro description': "Fatty acid synthase type I, Fatty acid synthase subunit alpha, 4'-phosphopantetheinyl transferase domain", 
'nunique': [1]
}

The sub-key value I want to turn into the list is 'interpro description'. I want it to be divided by ','. So [0] value of the list would be "Fatty acid synthase type I" and [1] "Fatty acid synthase subunit alpha". It is very important that these values would preserve input order.

CodePudding user response：

Using split():

yourdict = {'C4QY10_e': 
{'protein accession': 'C4QY10_e', 
'sequence length': [1879], 
'analysis': 'Pfam', 
'signature accession': 'PF18314, PF02801, PF18325, PF00109, PF01648', 
'signature description': "Fatty acid synthase type I helical domain, Beta-ketoacyl synthase, Fatty acid synthase subunit alpha Acyl carrier domain, 4'-phosphopantetheinyl transferase superfamily", 'start location': [328, 139, 1761], 
'stop location': [528, 300, 1861], 
'e-value': [4.7e-73, 1.3e-72, 1.4e-18], 
'interpro accession': 'IPR041550, IPR040899, IPR008278', 
'interpro description': "Fatty acid synthase type I, Fatty acid synthase subunit alpha, 4'-phosphopantetheinyl transferase domain", 
'nunique': [1]
}}

yourdict['C4QY10_e']['interpro description'] = yourdict['C4QY10_e']['interpro description'].split(', ')
print(yourdict)

{'C4QY10_e': {'protein accession': 'C4QY10_e',
  'sequence length': [1879],
  'analysis': 'Pfam',
  'signature accession': 'PF18314, PF02801, PF18325, PF00109, PF01648',
  'signature description': "Fatty acid synthase type I helical domain, Beta-ketoacyl synthase, Fatty acid synthase subunit alpha Acyl carrier domain, 4'-phosphopantetheinyl transferase superfamily",
  'start location': [328, 139, 1761],
  'stop location': [528, 300, 1861],
  'e-value': [4.7e-73, 1.3e-72, 1.4e-18],
  'interpro accession': 'IPR041550, IPR040899, IPR008278',
  'interpro description': ['Fatty acid synthase type I',
   'Fatty acid synthase subunit alpha',
   "4'-phosphopantetheinyl transferase domain"],
  'nunique': [1]}}

CodePudding user response：

Here's another solution:


def to_list(_dict: dict, _key: str = 'interpro description') -> dict:    
    """Convert a string to a list of strings.

    Parameters
    ----------
    _dict : dict
        A dictionary to convert `_key` into a list.
    _key : str
        A key in the dictionary.

    Returns
    -------
    dict
        The original dictionary, with `_key` modified into a list of strings.

    Notes
    -----
    Function accepts dictionaries with multiple levels.
    """
    for key, value in _dict.items():
        if isinstance(value, dict):
            _dict[key] = to_list(value, _key)
        if key == _key and isinstance(value, str):
            _dict[key] = list(
                map(
                    lambda value: value.lstrip(" "),
                    value.split(',')
                )
            )
    return _dict


# == Example ==========
my_dict = {
    'C4QY10_e': {
        'protein accession': 'C4QY10_e',
        'sequence length': [1879],
        'analysis': 'Pfam',
        'signature accession': 'PF18314, PF02801, PF18325, PF00109, PF01648', 
        'signature description': "Fatty acid synthase type I helical domain, Beta-ketoacyl synthase, Fatty acid synthase subunit alpha Acyl carrier domain, 4'-phosphopantetheinyl transferase superfamily",
        'start location': [328, 139, 1761],
        'stop location': [528, 300, 1861],
        'e-value': [4.7e-73, 1.3e-72, 1.4e-18],
        'interpro accession': 'IPR041550, IPR040899, IPR008278', 
        'interpro description': "Fatty acid synthase type I, Fatty acid synthase subunit alpha, 4'-phosphopantetheinyl transferase domain",
        'nunique': [1],
    }
}

_my_dict = to_list(my_dict)
print(_my_dict['C4QY10_e']['interpro description'])
# Prints:
# ['Fatty acid synthase type I', 'Fatty acid synthase subunit alpha', "4'-phosphopantetheinyl transferase domain"]