Home > Enterprise >  Construct a pandas DataFrame from items in a nested dictionary with lists as inner values
Construct a pandas DataFrame from items in a nested dictionary with lists as inner values

Time:11-21

I have a nested dictionary annot_dict with structure:

  • key = long unique string
  • value = list of dictionaries

The values, the list of dictionaries, each have structure:

  • key = long unique string (a subcategory of the upper dictionary's key)
  • value = list of five string items

An example of the entire structure is:

annot_dict['ID_string'] = [
     {'ID_string': ['attr1a', 'attr1b', 'attr1c', 'attr1d', 'attr1e']},
     {'string2'  : ['attr2a', 'attr2b', 'attr2c', 'attr2d', 'attr2e']},
     {'string3'  : ['attr3a', 'attr3b', 'attr3c', 'attr3d', 'attr3e']},
             ]

The ID_string is the same as the first sub-dictionary key. This is the output of a gff3 file parser function I wrote and the real dictionary information is the genes (ID_string) and transcripts (string2, string3,...) from the genome of human chromosome 9, if anyone is familiar with the structure of that file type. The attribute lists describe biotype, start index, end index, strand, and description.

I want to put this information into a pandas DataFrame now. I want to loop through the outermost keys (the ID_strings) in the dict to make one big DataFrame containing a row for each ID_string and rows for each of its subcategories underneath it (string2, string3).

I want it to look like this:

| subunit_ID |  gene_ID  | start_index | end_index | strand |biotype | desc   |
|------------|-----------|-------------|-----------|--------|--------|--------|
|'ID_string' |'ID_string'|  'attr1a'   | 'attr1b'  |'attr1c'|'attr1d'|'attr1e'|
| 'string2'  |'ID_string'|  'attr2a'   | 'attr2b'  |'attr2c'|'attr2d'|'attr2e'|
| 'string3'  |'ID_string'|  'attr3a'   | 'attr3b'  |'attr3c'|'attr3d'|'attr3e'|

I did look at other answers but none had quite the same dict structure as I do. This is my first question on SO so please feel free to improve the understandability of my question. Thanks in advance.

CodePudding user response:

You could do:

df =  pd.DataFrame(
    (
        [subkey, key]   value
        for key, records in annot_dict.items()
        for record in records
        for subkey, value in record.items()
    ),
    columns=[
        'subunit_ID', 'gene_ID', 'start_index', 'end_index', 'strand','biotype', 'desc'
    ]
)

Result for

annot_dict = {
    'ID_string1': [
        {'ID_string1': ['attr11a', 'attr11b', 'attr11c', 'attr11d', 'attr11e']},
        {'string12'  : ['attr12a', 'attr12b', 'attr12c', 'attr12d', 'attr12e']},
        {'string13'  : ['attr13a', 'attr13b', 'attr13c', 'attr13d', 'attr13e']},
    ],
    'ID_string2': [
        {'ID_string2': ['attr21a', 'attr21b', 'attr21c', 'attr21d', 'attr21e']},
        {'string22'  : ['attr22a', 'attr22b', 'attr22c', 'attr22d', 'attr22e']},
        {'string23'  : ['attr23a', 'attr23b', 'attr23c', 'attr23d', 'attr23e']},
    ]
}

is

   subunit_ID     gene_ID start_index end_index   strand  biotype     desc
0  ID_string1  ID_string1     attr11a   attr11b  attr11c  attr11d  attr11e
1    string12  ID_string1     attr12a   attr12b  attr12c  attr12d  attr12e
2    string13  ID_string1     attr13a   attr13b  attr13c  attr13d  attr13e
3  ID_string2  ID_string2     attr21a   attr21b  attr21c  attr21d  attr21e
4    string22  ID_string2     attr22a   attr22b  attr22c  attr22d  attr22e
5    string23  ID_string2     attr23a   attr23b  attr23c  attr23d  attr23e

CodePudding user response:

You could use list comprehension to flatten the dicts to lists that include the dict keys as items, then load it to pandas:

import pandas as pd

annot_dict = {}
annot_dict['ID_string'] = [
     {'ID_string': ['attr1a', 'attr1b', 'attr1c', 'attr1d', 'attr1e']},
     {'string2'  : ['attr2a', 'attr2b', 'attr2c', 'attr2d', 'attr2e']},
     {'string3'  : ['attr3a', 'attr3b', 'attr3c', 'attr3d', 'attr3e']},
             ]

df = pd.DataFrame([[k] list(annot_dict['ID_string'][0].keys()) v for i in annot_dict['ID_string'] for k, v in i.items()], columns=['subunit_ID','gene_ID','start_index','end_index','strand','biotype','desc'])

output:

subunit_ID gene_ID start_index end_index strand biotype desc
0 ID_string ID_string attr1a attr1b attr1c attr1d attr1e
1 string2 ID_string attr2a attr2b attr2c attr2d attr2e
2 string3 ID_string attr3a attr3b attr3c attr3d attr3e
  • Related