Construct a pandas DataFrame from items in a nested dictionary with lists as inner values-CodePudding

I have a nested dictionary annot_dict with structure:

key = long unique string
value = list of dictionaries

The values, the list of dictionaries, each have structure:

key = long unique string (a subcategory of the upper dictionary's key)
value = list of five string items

An example of the entire structure is:

annot_dict['ID_string'] = [
     {'ID_string': ['attr1a', 'attr1b', 'attr1c', 'attr1d', 'attr1e']},
     {'string2'  : ['attr2a', 'attr2b', 'attr2c', 'attr2d', 'attr2e']},
     {'string3'  : ['attr3a', 'attr3b', 'attr3c', 'attr3d', 'attr3e']},
             ]

The ID_string is the same as the first sub-dictionary key. This is the output of a gff3 file parser function I wrote and the real dictionary information is the genes (ID_string) and transcripts (string2, string3,...) from the genome of human chromosome 9, if anyone is familiar with the structure of that file type. The attribute lists describe biotype, start index, end index, strand, and description.

I want to put this information into a pandas DataFrame now. I want to loop through the outermost keys (the ID_strings) in the dict to make one big DataFrame containing a row for each ID_string and rows for each of its subcategories underneath it (string2, string3).

I want it to look like this:

| subunit_ID |  gene_ID  | start_index | end_index | strand |biotype | desc   |
|------------|-----------|-------------|-----------|--------|--------|--------|
|'ID_string' |'ID_string'|  'attr1a'   | 'attr1b'  |'attr1c'|'attr1d'|'attr1e'|
| 'string2'  |'ID_string'|  'attr2a'   | 'attr2b'  |'attr2c'|'attr2d'|'attr2e'|
| 'string3'  |'ID_string'|  'attr3a'   | 'attr3b'  |'attr3c'|'attr3d'|'attr3e'|

I did look at other answers but none had quite the same dict structure as I do. This is my first question on SO so please feel free to improve the understandability of my question. Thanks in advance.

CodePudding user response：

You could do:

df =  pd.DataFrame(
    (
        [subkey, key]   value
        for key, records in annot_dict.items()
        for record in records
        for subkey, value in record.items()
    ),
    columns=[
        'subunit_ID', 'gene_ID', 'start_index', 'end_index', 'strand','biotype', 'desc'
    ]
)

Result for

annot_dict = {
    'ID_string1': [
        {'ID_string1': ['attr11a', 'attr11b', 'attr11c', 'attr11d', 'attr11e']},
        {'string12'  : ['attr12a', 'attr12b', 'attr12c', 'attr12d', 'attr12e']},
        {'string13'  : ['attr13a', 'attr13b', 'attr13c', 'attr13d', 'attr13e']},
    ],
    'ID_string2': [
        {'ID_string2': ['attr21a', 'attr21b', 'attr21c', 'attr21d', 'attr21e']},
        {'string22'  : ['attr22a', 'attr22b', 'attr22c', 'attr22d', 'attr22e']},
        {'string23'  : ['attr23a', 'attr23b', 'attr23c', 'attr23d', 'attr23e']},
    ]
}

   subunit_ID     gene_ID start_index end_index   strand  biotype     desc
0  ID_string1  ID_string1     attr11a   attr11b  attr11c  attr11d  attr11e
1    string12  ID_string1     attr12a   attr12b  attr12c  attr12d  attr12e
2    string13  ID_string1     attr13a   attr13b  attr13c  attr13d  attr13e
3  ID_string2  ID_string2     attr21a   attr21b  attr21c  attr21d  attr21e
4    string22  ID_string2     attr22a   attr22b  attr22c  attr22d  attr22e
5    string23  ID_string2     attr23a   attr23b  attr23c  attr23d  attr23e

CodePudding user response：

You could use list comprehension to flatten the dicts to lists that include the dict keys as items, then load it to pandas:

import pandas as pd

annot_dict = {}
annot_dict['ID_string'] = [
     {'ID_string': ['attr1a', 'attr1b', 'attr1c', 'attr1d', 'attr1e']},
     {'string2'  : ['attr2a', 'attr2b', 'attr2c', 'attr2d', 'attr2e']},
     {'string3'  : ['attr3a', 'attr3b', 'attr3c', 'attr3d', 'attr3e']},
             ]

df = pd.DataFrame([[k] list(annot_dict['ID_string'][0].keys()) v for i in annot_dict['ID_string'] for k, v in i.items()], columns=['subunit_ID','gene_ID','start_index','end_index','strand','biotype','desc'])

output:

	subunit_ID	gene_ID	start_index	end_index	strand	biotype	desc
0	ID_string	ID_string	attr1a	attr1b	attr1c	attr1d	attr1e
1	string2	ID_string	attr2a	attr2b	attr2c	attr2d	attr2e
2	string3	ID_string	attr3a	attr3b	attr3c	attr3d	attr3e