I have a nested dictionary annot_dict
with structure:
- key = long unique string
- value = list of dictionaries
The values, the list of dictionaries, each have structure:
- key = long unique string (a subcategory of the upper dictionary's key)
- value = list of five string items
An example of the entire structure is:
annot_dict['ID_string'] = [
{'ID_string': ['attr1a', 'attr1b', 'attr1c', 'attr1d', 'attr1e']},
{'string2' : ['attr2a', 'attr2b', 'attr2c', 'attr2d', 'attr2e']},
{'string3' : ['attr3a', 'attr3b', 'attr3c', 'attr3d', 'attr3e']},
]
The ID_string
is the same as the first sub-dictionary key. This is the output of a gff3 file parser function I wrote and the real dictionary information is the genes (ID_string
) and transcripts (string2
, string3
,...) from the genome of human chromosome 9, if anyone is familiar with the structure of that file type. The attribute lists describe biotype, start index, end index, strand, and description.
I want to put this information into a pandas DataFrame now. I want to loop through the outermost keys (the ID_string
s) in the dict to make one big DataFrame containing a row for each ID_string
and rows for each of its subcategories underneath it (string2
, string3
).
I want it to look like this:
| subunit_ID | gene_ID | start_index | end_index | strand |biotype | desc |
|------------|-----------|-------------|-----------|--------|--------|--------|
|'ID_string' |'ID_string'| 'attr1a' | 'attr1b' |'attr1c'|'attr1d'|'attr1e'|
| 'string2' |'ID_string'| 'attr2a' | 'attr2b' |'attr2c'|'attr2d'|'attr2e'|
| 'string3' |'ID_string'| 'attr3a' | 'attr3b' |'attr3c'|'attr3d'|'attr3e'|
I did look at other answers but none had quite the same dict structure as I do. This is my first question on SO so please feel free to improve the understandability of my question. Thanks in advance.
CodePudding user response:
You could do:
df = pd.DataFrame(
(
[subkey, key] value
for key, records in annot_dict.items()
for record in records
for subkey, value in record.items()
),
columns=[
'subunit_ID', 'gene_ID', 'start_index', 'end_index', 'strand','biotype', 'desc'
]
)
Result for
annot_dict = {
'ID_string1': [
{'ID_string1': ['attr11a', 'attr11b', 'attr11c', 'attr11d', 'attr11e']},
{'string12' : ['attr12a', 'attr12b', 'attr12c', 'attr12d', 'attr12e']},
{'string13' : ['attr13a', 'attr13b', 'attr13c', 'attr13d', 'attr13e']},
],
'ID_string2': [
{'ID_string2': ['attr21a', 'attr21b', 'attr21c', 'attr21d', 'attr21e']},
{'string22' : ['attr22a', 'attr22b', 'attr22c', 'attr22d', 'attr22e']},
{'string23' : ['attr23a', 'attr23b', 'attr23c', 'attr23d', 'attr23e']},
]
}
is
subunit_ID gene_ID start_index end_index strand biotype desc
0 ID_string1 ID_string1 attr11a attr11b attr11c attr11d attr11e
1 string12 ID_string1 attr12a attr12b attr12c attr12d attr12e
2 string13 ID_string1 attr13a attr13b attr13c attr13d attr13e
3 ID_string2 ID_string2 attr21a attr21b attr21c attr21d attr21e
4 string22 ID_string2 attr22a attr22b attr22c attr22d attr22e
5 string23 ID_string2 attr23a attr23b attr23c attr23d attr23e
CodePudding user response:
You could use list comprehension to flatten the dicts to lists that include the dict keys as items, then load it to pandas:
import pandas as pd
annot_dict = {}
annot_dict['ID_string'] = [
{'ID_string': ['attr1a', 'attr1b', 'attr1c', 'attr1d', 'attr1e']},
{'string2' : ['attr2a', 'attr2b', 'attr2c', 'attr2d', 'attr2e']},
{'string3' : ['attr3a', 'attr3b', 'attr3c', 'attr3d', 'attr3e']},
]
df = pd.DataFrame([[k] list(annot_dict['ID_string'][0].keys()) v for i in annot_dict['ID_string'] for k, v in i.items()], columns=['subunit_ID','gene_ID','start_index','end_index','strand','biotype','desc'])
output:
subunit_ID | gene_ID | start_index | end_index | strand | biotype | desc | |
---|---|---|---|---|---|---|---|
0 | ID_string | ID_string | attr1a | attr1b | attr1c | attr1d | attr1e |
1 | string2 | ID_string | attr2a | attr2b | attr2c | attr2d | attr2e |
2 | string3 | ID_string | attr3a | attr3b | attr3c | attr3d | attr3e |