df = pd.DataFrame.from_dict(dict_name, orient='index')
df.fillna('NaN', inplace=True)
df.to_csv('taxonomy_3.csv', index=True, header=True)
The above code handles a nested dictionary to dataframe conversion perfectly fine but if you have a nested dictionary created with the .append()
or .extend()
method it adds extraneous brackets[]
and quotes ''
which is making downstream analysis difficult.
For example for a nested dictionary like this:
{'Ceratopteris richardii': {'superkingdom': ['Eukaryota'], 'kingdom': ['Viridiplantae'], 'phylum': ['Streptophyta'], 'subphylum': ['Streptophytina'], 'clade': ['Embryophyta', 'Tracheophyta', 'Euphyllophyta'], 'class': ['Polypodiopsida'], 'subclass': ['Polypodiidae'], 'order': ['Polypodiales'], 'suborder': ['Pteridineae'], 'family': ['Pteridaceae'], 'subfamily': ['Parkerioideae'], 'genus': ['Ceratopteris']}, 'Arabidopsis thaliana': {'superkingdom': ['Eukaryota'], 'kingdom': ['Viridiplantae'], 'phylum': ['Streptophyta'], 'subphylum': ['Streptophytina'], 'clade': ['Embryophyta', 'Tracheophyta', 'Euphyllophyta', 'Spermatophyta', 'Mesangiospermae', 'eudicotyledons', 'Gunneridae', 'Pentapetalae', 'rosids', 'malvids'], 'class': ['Magnoliopsida'], 'order': ['Brassicales'], 'family': ['Brassicaceae'], 'tribe': ['Camelineae'], 'genus': ['Arabidopsis']}}
created with the setup:
line = line.strip()# remove newline character
words = line.split("\t",1) # split the line at the first tab
if words[0] in taxonomy[name]: # add value if key already exists
taxonomy[name][words[0]].append(words[1])
else: # add key and value if key does not exist
taxonomy[name][words[0]] = [words[1]]
And converted to a dataframe with pd.dataframe.from_dict()
Creates a table that looks like this:
Columns one | Column two |
---|---|
Key1 | ['Value1','Value2','value3'] |
Key2 | ['Value2','value4','value5'] |
here the cells become a single lump of strings and lose a level of data
While something like would be more ideal to preserve a whole level of data:
Columns one | Column two |
---|---|
Key1 | Value1,Value2,value3 |
Key2 | Value2,value4,value5 |
It seems the extraneous characters are essential delimiters and can't be done without when updating keys, so best I can tell that rules out extending the values without brackets or quotes.
What would be more appropriate:
- Try to convert to dataframe from dictionary and remove extraneous characters in conversion? If so, how?
- Remove brackets and quotes with regex once the dataframe is created?
CodePudding user response:
One option is to stack
the columns, join
the strings, then unstack
:
out = pd.DataFrame(my_data).stack().map(', '.join).unstack()
But it's probably more efficient to modify the input dictionary in vanilla Python first and then construct the DataFrame:
for d in my_data.values():
for k,v in d.items():
d[k] = ', '.join(v)
out = pd.DataFrame(my_data)
Output:
Ceratopteris richardii Arabidopsis thaliana
superkingdom Eukaryota Eukaryota
kingdom Viridiplantae Viridiplantae
phylum Streptophyta Streptophyta
subphylum Streptophytina Streptophytina
clade Embryophyta, Tracheophyta, Euphyllophyta Embryophyta, Tracheophyta, Euphyllophyta, Sper...
class Polypodiopsida Magnoliopsida
subclass Polypodiidae NaN
order Polypodiales Brassicales
suborder Pteridineae NaN
family Pteridaceae Brassicaceae
subfamily Parkerioideae NaN
genus Ceratopteris Arabidopsis
tribe NaN Camelineae