How to remove brackets from multi-value keys when converting to dataframes or extend values of a key-CodePudding

df = pd.DataFrame.from_dict(dict_name, orient='index')
df.fillna('NaN', inplace=True)
df.to_csv('taxonomy_3.csv', index=True, header=True)

The above code handles a nested dictionary to dataframe conversion perfectly fine but if you have a nested dictionary created with the .append() or .extend() method it adds extraneous brackets[] and quotes '' which is making downstream analysis difficult.

For example for a nested dictionary like this:

{'Ceratopteris richardii': {'superkingdom': ['Eukaryota'], 'kingdom': ['Viridiplantae'], 'phylum': ['Streptophyta'], 'subphylum': ['Streptophytina'], 'clade': ['Embryophyta', 'Tracheophyta', 'Euphyllophyta'], 'class': ['Polypodiopsida'], 'subclass': ['Polypodiidae'], 'order': ['Polypodiales'], 'suborder': ['Pteridineae'], 'family': ['Pteridaceae'], 'subfamily': ['Parkerioideae'], 'genus': ['Ceratopteris']}, 'Arabidopsis thaliana': {'superkingdom': ['Eukaryota'], 'kingdom': ['Viridiplantae'], 'phylum': ['Streptophyta'], 'subphylum': ['Streptophytina'], 'clade': ['Embryophyta', 'Tracheophyta', 'Euphyllophyta', 'Spermatophyta', 'Mesangiospermae', 'eudicotyledons', 'Gunneridae', 'Pentapetalae', 'rosids', 'malvids'], 'class': ['Magnoliopsida'], 'order': ['Brassicales'], 'family': ['Brassicaceae'], 'tribe': ['Camelineae'], 'genus': ['Arabidopsis']}}

created with the setup:

line = line.strip()# remove newline character

words = line.split("\t",1) # split the line at the first tab

        if words[0] in taxonomy[name]: # add value if key already exists

            taxonomy[name][words[0]].append(words[1])

        else: # add key and value if key does not exist

            taxonomy[name][words[0]] = [words[1]]

And converted to a dataframe with pd.dataframe.from_dict()

Creates a table that looks like this:

Columns one	Column two
Key1	['Value1','Value2','value3']
Key2	['Value2','value4','value5']

here the cells become a single lump of strings and lose a level of data

While something like would be more ideal to preserve a whole level of data:

Columns one	Column two
Key1	Value1,Value2,value3
Key2	Value2,value4,value5

It seems the extraneous characters are essential delimiters and can't be done without when updating keys, so best I can tell that rules out extending the values without brackets or quotes.

What would be more appropriate:

Try to convert to dataframe from dictionary and remove extraneous characters in conversion? If so, how?
Remove brackets and quotes with regex once the dataframe is created?

CodePudding user response：

One option is to stack the columns, join the strings, then unstack:

out = pd.DataFrame(my_data).stack().map(', '.join).unstack()

But it's probably more efficient to modify the input dictionary in vanilla Python first and then construct the DataFrame:

for d in my_data.values():
    for k,v in d.items():
        d[k] = ', '.join(v)
out = pd.DataFrame(my_data)

Output:

                                Ceratopteris richardii                               Arabidopsis thaliana
superkingdom                                 Eukaryota                                          Eukaryota
kingdom                                  Viridiplantae                                      Viridiplantae
phylum                                    Streptophyta                                       Streptophyta
subphylum                               Streptophytina                                     Streptophytina
clade         Embryophyta, Tracheophyta, Euphyllophyta  Embryophyta, Tracheophyta, Euphyllophyta, Sper...
class                                   Polypodiopsida                                      Magnoliopsida
subclass                                  Polypodiidae                                                NaN
order                                     Polypodiales                                        Brassicales
suborder                                   Pteridineae                                                NaN
family                                     Pteridaceae                                       Brassicaceae
subfamily                                Parkerioideae                                                NaN
genus                                     Ceratopteris                                        Arabidopsis
tribe                                              NaN                                         Camelineae