The problem
I am facing a problem as I am managing a dataset each of which entry has associated a dictionary in the form of
dictionary = {
'Step_1': {
'Q=123',
'W=456',
'E=789'
},
'Step_2': {
'Q=753',
'W=159',
'E=888'
}
}
please note that the dicts have a variable number of Step
s
So I am organising the data into a pandas df like:
dicts
0 {'Step_1': {'Q': '123', 'W': '456', ...
1 {'Step_1': {'Q': '123', 'W': '456', ...
2 {'Step_1': {'Q': '123', 'W': '456', ...
and would like now to do some row-wise operations, like getting each dict['Step_1']['Q']
value.
I know that it's generally suggested to not work with dicts as df values, so I'd like to use a good, pythonic (read: fast) solution.
How would you proceed to get each dict['Step_1']['Q']
row-wise?
What I tried
A simple solution that came to my mind was:
df[dicts]['Step_1']['Q']
, but it doesn't seem to work. (Why? Might it be because this way pandas doesn't "unpack" the row values, hence cannot access the dicts?)
A more complex solution that I found to work is to use a function to access the data, as follows:
def access(x):
return (x["Step_1"]["V"])
df['new_col'] = df['dicts'].apply(lambda x: access(x))
but I don't quite like this solution. As far as I know, using the apply method is not the optimal way to tackle the problem.
CodePudding user response:
One idea that comes to my mind is to use multi-indexing to unpack the dictionaries that are saved in the dataframe right now. Say there are two dictionaries:
dictionary1 = {
'Step_1': {
'Q' :123,
'W':456,
'E':789
},
'Step_2': {
'Q':753,
'W':159,
'E':888
}
}
dictionary2 = {
'Step_1': {
'Q' :789,
'W':456,
'E':123
},
'Step_2': {
'Q':357,
'W':951,
'E':888
}
}
Then you could create a dataframe from the dictionaries:
df = pd.DataFrame.from_dict([dictionary1, dictionary2])
Then, inspired by this thread Nested dictionary to multiindex dataframe where dictionary keys are column labels, you could create the multiindex to unpack the nested dictionaries:
multituple = {(outerKey, innerKey): values for outerKey, innerDict in df.iteritems() for innerKey, values in innerDict.iteritems()}
This results in Q, W, and E being an index and not a column, but if you have a varying number of steps per dictionary, this might actually be useful since you would not face the issue with varying column lengths.
>>> pd.DataFrame(multituple)
Step_1 Step_2
0 1 0 1
Q 123 789 753 357
W 456 456 159 951
E 789 123 888 888
Hope this gives an idea for you to explore.
CodePudding user response:
I think you should reshape your dataset. Check this out:
# Let's say we have this
dictionary = {
"Step_1": {"Q=123", "W=456", "E=789"},
"Step_2": {"Q=753", "W=159", "E=888"},
}
dicts = [dictionary, dictionary] # This would be your dataset
Do this:
better = []
for i, d in enumerate(dicts): # d is a dictionary
# Iterate over the keys and values of the dictionary
for k, v in d.items():
# Get the step from the key
step = k.split("_")[1]
# Separate the key and value (Q=123, W=456, E=789)
new_d = {x.split("=")[0]: x.split("=")[1] for x in v}
# Add this step to the new list
better.append({"id": i, "step": step, **new_d})
This is what better
is now:
[{'id': 0, 'step': '1', 'W': '456', 'Q': '123', 'E': '789'},
{'id': 0, 'step': '2', 'W': '159', 'E': '888', 'Q': '753'},
{'id': 1, 'step': '1', 'W': '456', 'Q': '123', 'E': '789'},
{'id': 1, 'step': '2', 'W': '159', 'E': '888', 'Q': '753'}]
Now you can re-build your DataFrame and perform all kind of row-wise operations:
df = pd.DataFrame(better)
id step W Q E
0 0 1 456 123 789
1 0 2 159 753 888
2 1 1 456 123 789
3 1 2 159 753 888
For example, get all of the "Step 1 W" values:
df[df.step == "1"].W
# Output:
0 456
2 456
Name: W, dtype: object
Note: it's probably a good idea to turn the columns to int
s, right now they are stored as str
s