Accessing dictionaries when having dictionary dataframe values-CodePudding

The problem

I am facing a problem as I am managing a dataset each of which entry has associated a dictionary in the form of

dictionary = {
    'Step_1': {
        'Q=123', 
        'W=456', 
        'E=789'
    },
    'Step_2': {
        'Q=753', 
        'W=159', 
        'E=888'
    }
}

please note that the dicts have a variable number of Steps So I am organising the data into a pandas df like:

                                        dicts
0    {'Step_1': {'Q': '123', 'W': '456', ...
1    {'Step_1': {'Q': '123', 'W': '456', ...
2    {'Step_1': {'Q': '123', 'W': '456', ...

and would like now to do some row-wise operations, like getting each dict['Step_1']['Q'] value.

I know that it's generally suggested to not work with dicts as df values, so I'd like to use a good, pythonic (read: fast) solution.

How would you proceed to get each dict['Step_1']['Q'] row-wise?

What I tried

A simple solution that came to my mind was: df[dicts]['Step_1']['Q'], but it doesn't seem to work. (Why? Might it be because this way pandas doesn't "unpack" the row values, hence cannot access the dicts?)

A more complex solution that I found to work is to use a function to access the data, as follows:

def access(x): 
    return (x["Step_1"]["V"])

df['new_col'] = df['dicts'].apply(lambda x: access(x))

but I don't quite like this solution. As far as I know, using the apply method is not the optimal way to tackle the problem.

CodePudding user response：

One idea that comes to my mind is to use multi-indexing to unpack the dictionaries that are saved in the dataframe right now. Say there are two dictionaries:

dictionary1 = {
    'Step_1': {
        'Q' :123, 
        'W':456, 
        'E':789
    },
    'Step_2': {
        'Q':753, 
        'W':159, 
        'E':888
    }
}

dictionary2 = {
    'Step_1': {
        'Q' :789, 
        'W':456, 
        'E':123
    },
    'Step_2': {
        'Q':357, 
        'W':951, 
        'E':888
    }
}

Then you could create a dataframe from the dictionaries:

df = pd.DataFrame.from_dict([dictionary1, dictionary2])

Then, inspired by this thread Nested dictionary to multiindex dataframe where dictionary keys are column labels, you could create the multiindex to unpack the nested dictionaries:

multituple = {(outerKey, innerKey): values for outerKey, innerDict in df.iteritems() for innerKey, values in innerDict.iteritems()}

This results in Q, W, and E being an index and not a column, but if you have a varying number of steps per dictionary, this might actually be useful since you would not face the issue with varying column lengths.

>>> pd.DataFrame(multituple)
    Step_1  Step_2
    0   1   0   1
Q   123 789 753 357
W   456 456 159 951
E   789 123 888 888

Hope this gives an idea for you to explore.

CodePudding user response：

I think you should reshape your dataset. Check this out:

# Let's say we have this
dictionary = {
    "Step_1": {"Q=123", "W=456", "E=789"},
    "Step_2": {"Q=753", "W=159", "E=888"},
}
dicts = [dictionary, dictionary] # This would be your dataset

Do this:

better = []
for i, d in enumerate(dicts): # d is a dictionary    
    # Iterate over the keys and values of the dictionary
    for k, v in d.items(): 
        # Get the step from the key
        step = k.split("_")[1]
        # Separate the key and value (Q=123, W=456, E=789)
        new_d = {x.split("=")[0]: x.split("=")[1] for x in v}
        # Add this step to the new list
        better.append({"id": i, "step": step, **new_d})

This is what better is now:

[{'id': 0, 'step': '1', 'W': '456', 'Q': '123', 'E': '789'},
 {'id': 0, 'step': '2', 'W': '159', 'E': '888', 'Q': '753'},
 {'id': 1, 'step': '1', 'W': '456', 'Q': '123', 'E': '789'},
 {'id': 1, 'step': '2', 'W': '159', 'E': '888', 'Q': '753'}]

Now you can re-build your DataFrame and perform all kind of row-wise operations:

df = pd.DataFrame(better)

    id  step W    Q    E
0   0   1    456  123  789
1   0   2    159  753  888
2   1   1    456  123  789
3   1   2    159  753  888

For example, get all of the "Step 1 W" values:

df[df.step == "1"].W
# Output:
0    456
2    456
Name: W, dtype: object

Note: it's probably a good idea to turn the columns to ints, right now they are stored as strs