I would like to get all the key values from a nested dictionary that is stored in lists. See example below:
#creating dataframe with nested dictionaries
test_dict_1={'results': [{'key': 'q1',
'value': ['1'],
'end_time': '2021-01-21',
'start_time': '2021-01-21',
'result_type': 'multipleChoice'},
{'key': 'q2',
'value': ['False'],
'end_time': '2021-01-21',
'start_time': '2021-01-21',
'result_type': 'multipleChoice'},
{'key': 'q3',
'value': ['3'],
'end_time': '2021-01-21',
'start_time': '2021-01-21',
'result_type': 'multipleChoice'},
{'key': 'q4',
'value': ['3'],
'end_time': '2021-01-21',
'start_time': '2021-01-21',
'result_type': 'multipleChoice'}]}
test_dict_2={'results': [{'key': 'survey_x',
'value': [[{'key': 'q1',
'value': 2,
'endTime': '2021-01-21',
'skipped': False,
'startTime': '2021-01-21',
'resultType': 'multipleChoice'},
{'key': 'q2',
'value': 0,
'endTime': '2021-01-21',
'skipped': False,
'startTime': '2021-01-21',
'resultType': 'multipleChoice'},
{'key': 'q3',
'value': 2,
'endTime':'2021-01-21',
'skipped': False,
'startTime': '2021-01-21',
'resultType': 'multipleChoice'},
{'key': 'q4',
'value': 0,
'endTime': '2021-01-21',
'skipped': False,
'startTime':'2021-01-21',
'resultType': 'multipleChoice'}]],
'skipped': False,
'end_time': '2021-01-21',
'start_time': '2021-01-21',
'result_type': 'grouped'}]}
df = pd.DataFrame()
df = pd.concat([df, pd.Series([test_dict_1])],ignore_index=True)
df = pd.concat([df, pd.Series([test_dict_2])],ignore_index=True)
df.head()
Below I created a for-loop that extracts all key values from all rows. Using this approach, I get the ideal key-value extraction from the first row, but not the second row. Can you help me figure out how to only extract the key-values from second row which contains a dictionary in a list?
for i in range(len(df)):
for key_, value_ in recursive_items(df.loc[i, 0]):
for element in value_:
keys_ = dict((k, element[k]) for k in ['key'] if k in element)
texts_ = dict((k, element[k]) for k in ['text'] if k in element)
values_ = dict((k, element[k]) for k in ['value'] if k in element)
#print('keys',keys_)
#print('text',texts_)
#print('values',values_)
new_data_dictionary=dict.fromkeys(keys_.values(),values_)
if bool(texts_):
new_data_dictionary.append(texts_)
print(new_data_dictionary)
df.loc[i,'key_values']=new_data_dictionary
The for-loop returns the following:
#this is what I would like
{'q1': {'value': ['1']}}
{'q2': {'value': ['False']}}
{'q3': {'value': ['3']}}
{'q4': {'value': ['3']}}
#this is not what I want. The issue is caused by the dictionary being in the list
{'survey_x': {'value': [[{'key': 'q1', 'value': 2, 'endTime': '2021-01-21', 'skipped': False, 'startTime': '2021-01-21', 'resultType': 'multipleChoice'}, {'key': 'q2', 'value': 0, 'endTime': '2021-01-21', 'skipped': False, 'startTime': '2021-01-21', 'resultType': 'multipleChoice'}, {'key': 'q3', 'value': 2, 'endTime': '2021-01-21', 'skipped': False, 'startTime': '2021-01-21', 'resultType': 'multipleChoice'}, {'key': 'q4', 'value': 0, 'endTime': '2021-01-21', 'skipped': False, 'startTime': '2021-01-21', 'resultType': 'multipleChoice'}]]}}
CodePudding user response:
This script doesn’t format the result values the way that you want, but it will help you figure out how to do it on your own:
test_dict_2={'results': [{'key': 'survey_x',
'value': [[{'key': 'q1',
'value': 2,
'endTime': '2021-01-21',
'skipped': False,
'startTime': '2021-01-21',
'resultType': 'multipleChoice'},
{'key': 'q2',
'value': 0,
'endTime': '2021-01-21',
'skipped': False,
'startTime': '2021-01-21',
'resultType': 'multipleChoice'},
{'key': 'q3',
'value': 2,
'endTime':'2021-01-21',
'skipped': False,
'startTime': '2021-01-21',
'resultType': 'multipleChoice'},
{'key': 'q4',
'value': 0,
'endTime': '2021-01-21',
'skipped': False,
'startTime':'2021-01-21',
'resultType': 'multipleChoice'}]],
'skipped': False,
'end_time': '2021-01-21',
'start_time': '2021-01-21',
'result_type': 'grouped'}]}
result_values = test_dict_2["results"][0]["value"][0]
for result_val in result_values:
# do something with each dictionary
It basically accesses the nested list of dictionaries. try replacing the comment line with print(result_val.items())
to see the output, and then I think you’ve got it from there.
CodePudding user response:
OP's issue is in the creation of the dataframe using test_dict_2
. OP wants to create dataframe from test_dict_2
, but, instead of using the key "survey_x", OP wants to go deeper into the nested dictionary.
For that, one can use pandas.DataFrame
as follows
df_2 = pd.DataFrame(test_dict_2['results'][0]['value'][0])
For the first one it is simpler, as the following will do the work
df_1 = pd.DataFrame(test_dict_1['results'])
Then, the loop should run. Also created a simple function that, given a specific dataframe, extracts the key values, just to test it
def extract_keys_values(df):
# Create a list of dictionaries
list_of_dicts = []
for index, row in df.iterrows():
# Create a dictionary for each row
dict_ = {}
for key, value in row.items():
# If the value is a list, extract the first element
if isinstance(value, list):
value = value[0]
# If the value is a dictionary, extract the value
if isinstance(value, dict):
value = value['value']
# Add key and value to dictionary
dict_[key] = value
# Add dictionary to list
list_of_dicts.append(dict_)
# Create dataframe from list of dictionaries
df = pd.DataFrame(list_of_dicts)
return df
And it works just fine. One can run it with the dataframes created above, to test, as follows
df_1 = extract_keys_values(df_1)
[Out]:
key value end_time start_time result_type
0 q1 1 2021-01-21 2021-01-21 multipleChoice
1 q2 False 2021-01-21 2021-01-21 multipleChoice
2 q3 3 2021-01-21 2021-01-21 multipleChoice
3 q4 3 2021-01-21 2021-01-21 multipleChoice
and
df_2 = extract_keys_values(df_2)
[Out]:
key value endTime skipped startTime resultType
0 q1 2 2021-01-21 False 2021-01-21 multipleChoice
1 q2 0 2021-01-21 False 2021-01-21 multipleChoice
2 q3 2 2021-01-21 False 2021-01-21 multipleChoice
3 q4 0 2021-01-21 False 2021-01-21 multipleChoice
Finally, if one wants to drop columns that one doesn't need, the following thread should be helpful: Delete a column from a Pandas DataFrame
CodePudding user response:
Two things to point out. First, your complicated for-loop setup is mostly unnecessary. If you tweak code used to create your dataframe, you can turn the list of dictionaries into rows of a data frame and access the values directly:
df1 = pd.DataFrame(test_dict_1['results'])
print(df1[['key', 'value']])
That code outputs the following:
key value
0 q1 [1]
1 q2 [False]
2 q3 [3]
3 q4 [3]
Second, the issue with the test_dict_2
is that it's structured differently. To create a dataframe from it, you'll need to make sure you correctly access the list of dictionaries:
df2 = pd.DataFrame(test_dict_2['results'][0]['value'][0])
print(df2[['key', 'value']])
Output:
key value
0 q1 2
1 q2 0
2 q3 2
3 q4 0
Note that, as-is, you'll have a hard time concatenating df1
and df2
- they have different column names and don't have the same number or order of columns. However, if you really need to concatenate the two frames together you can do so by renaming columns to match up nicely and then following the documentation here. For example, in this case you could use the following code to concatenate the data frames:
# Create dataframes
df1 = pd.DataFrame(test_dict_1['results'])
df2 = pd.DataFrame(test_dict_2['results'][0]['value'][0])
# Prepare for concatenation
df2.columns = ['end_time', 'key', 'result_type', 'skipped', 'start_time', 'value']
df2 = df2.drop(columns='skipped')
df1['value'] = [v[0] for v in df1['value']]
# Concatenate data frames
df = pd.concat([df1, df2])
print(df)
# Print all key/value pairs
print(df[['key', 'value']])