How to get values from list of nested dictionaries?-CodePudding

I would like to get all the key values from a nested dictionary that is stored in lists. See example below:

#creating dataframe with nested dictionaries
test_dict_1={'results': [{'key': 'q1',
   'value': ['1'],
   'end_time': '2021-01-21',
   'start_time': '2021-01-21',
   'result_type': 'multipleChoice'},
  {'key': 'q2',
   'value': ['False'],
   'end_time': '2021-01-21',
   'start_time': '2021-01-21',
   'result_type': 'multipleChoice'},
  {'key': 'q3',
   'value': ['3'],
   'end_time': '2021-01-21',
   'start_time': '2021-01-21',
   'result_type': 'multipleChoice'},
  {'key': 'q4',
   'value': ['3'],
   'end_time': '2021-01-21',
   'start_time': '2021-01-21',
   'result_type': 'multipleChoice'}]}
  


test_dict_2={'results': [{'key': 'survey_x',
   'value': [[{'key': 'q1',
      'value': 2,
      'endTime': '2021-01-21',
      'skipped': False,
      'startTime': '2021-01-21',
      'resultType': 'multipleChoice'},
     {'key': 'q2',
      'value': 0,
      'endTime': '2021-01-21',
      'skipped': False,
      'startTime': '2021-01-21',
      'resultType': 'multipleChoice'},
     {'key': 'q3',
      'value': 2,
      'endTime':'2021-01-21',
      'skipped': False,
      'startTime': '2021-01-21',
      'resultType': 'multipleChoice'},
     {'key': 'q4',
      'value': 0,
      'endTime': '2021-01-21',
      'skipped': False,
      'startTime':'2021-01-21',
      'resultType': 'multipleChoice'}]],
   'skipped': False,
   'end_time': '2021-01-21',
   'start_time': '2021-01-21',
   'result_type': 'grouped'}]}


df = pd.DataFrame()
df = pd.concat([df, pd.Series([test_dict_1])],ignore_index=True)
df = pd.concat([df, pd.Series([test_dict_2])],ignore_index=True)

df.head()

Below I created a for-loop that extracts all key values from all rows. Using this approach, I get the ideal key-value extraction from the first row, but not the second row. Can you help me figure out how to only extract the key-values from second row which contains a dictionary in a list?

for i in range(len(df)):
    for key_, value_ in recursive_items(df.loc[i, 0]):
        for element in value_:
            keys_ = dict((k, element[k]) for k in ['key'] if k in element)
            texts_ = dict((k, element[k]) for k in ['text'] if k in element)
            values_ = dict((k, element[k]) for k in ['value'] if k in element)
            
            #print('keys',keys_)
            #print('text',texts_)
            #print('values',values_)

            new_data_dictionary=dict.fromkeys(keys_.values(),values_)
            if bool(texts_):
                new_data_dictionary.append(texts_)
            print(new_data_dictionary)
            df.loc[i,'key_values']=new_data_dictionary

The for-loop returns the following:

#this is what I would like
{'q1': {'value': ['1']}}
{'q2': {'value': ['False']}}
{'q3': {'value': ['3']}}
{'q4': {'value': ['3']}}

#this is not what I want. The issue is caused by the dictionary being in the list
{'survey_x': {'value': [[{'key': 'q1', 'value': 2, 'endTime': '2021-01-21', 'skipped': False, 'startTime': '2021-01-21', 'resultType': 'multipleChoice'}, {'key': 'q2', 'value': 0, 'endTime': '2021-01-21', 'skipped': False, 'startTime': '2021-01-21', 'resultType': 'multipleChoice'}, {'key': 'q3', 'value': 2, 'endTime': '2021-01-21', 'skipped': False, 'startTime': '2021-01-21', 'resultType': 'multipleChoice'}, {'key': 'q4', 'value': 0, 'endTime': '2021-01-21', 'skipped': False, 'startTime': '2021-01-21', 'resultType': 'multipleChoice'}]]}}

CodePudding user response：

This script doesn’t format the result values the way that you want, but it will help you figure out how to do it on your own:

test_dict_2={'results': [{'key': 'survey_x',
   'value': [[{'key': 'q1',
      'value': 2,
      'endTime': '2021-01-21',
      'skipped': False,
      'startTime': '2021-01-21',
      'resultType': 'multipleChoice'},
     {'key': 'q2',
      'value': 0,
      'endTime': '2021-01-21',
      'skipped': False,
      'startTime': '2021-01-21',
      'resultType': 'multipleChoice'},
     {'key': 'q3',
      'value': 2,
      'endTime':'2021-01-21',
      'skipped': False,
      'startTime': '2021-01-21',
      'resultType': 'multipleChoice'},
     {'key': 'q4',
      'value': 0,
      'endTime': '2021-01-21',
      'skipped': False,
      'startTime':'2021-01-21',
      'resultType': 'multipleChoice'}]],
   'skipped': False,
   'end_time': '2021-01-21',
   'start_time': '2021-01-21',
   'result_type': 'grouped'}]}

result_values = test_dict_2["results"][0]["value"][0]
for result_val in result_values:
    # do something with each dictionary

It basically accesses the nested list of dictionaries. try replacing the comment line with print(result_val.items()) to see the output, and then I think you’ve got it from there.

CodePudding user response：

OP's issue is in the creation of the dataframe using test_dict_2. OP wants to create dataframe from test_dict_2, but, instead of using the key "survey_x", OP wants to go deeper into the nested dictionary.

For that, one can use pandas.DataFrame as follows

df_2 = pd.DataFrame(test_dict_2['results'][0]['value'][0])

For the first one it is simpler, as the following will do the work

df_1 = pd.DataFrame(test_dict_1['results'])

Then, the loop should run. Also created a simple function that, given a specific dataframe, extracts the key values, just to test it

def extract_keys_values(df):
    # Create a list of dictionaries
    list_of_dicts = []
    for index, row in df.iterrows():
        # Create a dictionary for each row
        dict_ = {}
        for key, value in row.items():
            # If the value is a list, extract the first element
            if isinstance(value, list):
                value = value[0]
            # If the value is a dictionary, extract the value
            if isinstance(value, dict):
                value = value['value']
            # Add key and value to dictionary
            dict_[key] = value
        # Add dictionary to list
        list_of_dicts.append(dict_)
    # Create dataframe from list of dictionaries
    df = pd.DataFrame(list_of_dicts)
    return df

And it works just fine. One can run it with the dataframes created above, to test, as follows

df_1 = extract_keys_values(df_1)

[Out]:
  key  value    end_time  start_time     result_type
0  q1      1  2021-01-21  2021-01-21  multipleChoice
1  q2  False  2021-01-21  2021-01-21  multipleChoice
2  q3      3  2021-01-21  2021-01-21  multipleChoice
3  q4      3  2021-01-21  2021-01-21  multipleChoice

and

df_2 = extract_keys_values(df_2)

[Out]:
  key  value     endTime  skipped   startTime      resultType
0  q1      2  2021-01-21    False  2021-01-21  multipleChoice
1  q2      0  2021-01-21    False  2021-01-21  multipleChoice
2  q3      2  2021-01-21    False  2021-01-21  multipleChoice
3  q4      0  2021-01-21    False  2021-01-21  multipleChoice

Finally, if one wants to drop columns that one doesn't need, the following thread should be helpful: Delete a column from a Pandas DataFrame

CodePudding user response：

Two things to point out. First, your complicated for-loop setup is mostly unnecessary. If you tweak code used to create your dataframe, you can turn the list of dictionaries into rows of a data frame and access the values directly:

df1 = pd.DataFrame(test_dict_1['results'])
print(df1[['key', 'value']])

That code outputs the following:

  key    value
0  q1      [1]
1  q2  [False]
2  q3      [3]
3  q4      [3]

Second, the issue with the test_dict_2 is that it's structured differently. To create a dataframe from it, you'll need to make sure you correctly access the list of dictionaries:

df2 = pd.DataFrame(test_dict_2['results'][0]['value'][0])
print(df2[['key', 'value']])

Output:

  key  value
0  q1      2
1  q2      0
2  q3      2
3  q4      0

Note that, as-is, you'll have a hard time concatenating df1 and df2 - they have different column names and don't have the same number or order of columns. However, if you really need to concatenate the two frames together you can do so by renaming columns to match up nicely and then following the documentation here. For example, in this case you could use the following code to concatenate the data frames:

# Create dataframes
df1 = pd.DataFrame(test_dict_1['results'])
df2 = pd.DataFrame(test_dict_2['results'][0]['value'][0])

# Prepare for concatenation
df2.columns = ['end_time', 'key', 'result_type', 'skipped', 'start_time', 'value']
df2 = df2.drop(columns='skipped')
df1['value'] = [v[0] for v in df1['value']]

# Concatenate data frames
df = pd.concat([df1, df2])
print(df)

# Print all key/value pairs
print(df[['key', 'value']])