Construct a Pandas dataframe from list of nested dictionaries containing some missing keys/values-CodePudding

I'm using an API to retrieve a list of nested dictionaries containing data from an online survey. The data looks as follows:

filtered_messages = [{'type': 'message',
  'subtype': 'bot_message',
  'text': "This content can't be displayed.",
  'ts': '1654800705.715459',
  'username': 'Typeform',
  'icons': {'image_48': 'www.example.com'},
  'bot_id': 'ABC4K1SJE',
  'app_id': 'DEFSC3RT6',
  'blocks': [{'type': 'section',
    'block_id': 'GUoS',
    'text': {'type': 'mrkdwn',
     'text': 'You have a new message.',
     'verbatim': False}},
   {'type': 'section',
    'block_id': 'abZ',
    'text': {'type': 'mrkdwn',
     'text': '*Where did you first hear about us?*\nAdvertisement',
     'verbatim': False}},
   {'type': 'section',
    'block_id': 'lEd',
    'text': {'type': 'mrkdwn',
     'text': '*Reason for purchase: *\nLeisure',
     'verbatim': False}},
   {'type': 'section',
    'block_id': '4=vEY',
    'text': {'type': 'mrkdwn',
     'text': '*Where do you reside? *\nNew York',
     'verbatim': False}},
   {'type': 'section',
    'block_id': 'P1SKA',
    'text': {'type': 'mrkdwn',
     'text': '*Are you purchasing online or in store? *\nIn Store',
     'verbatim': False}},
   {'type': 'section',
    'block_id': 'f74v',
    'text': {'type': 'mrkdwn',
     'text': '*What was the location of your purchase? *\nNew York / City',
     'verbatim': False}},
   {'type': 'section',
    'block_id': '3tr6',
    'text': {'type': 'mrkdwn',
     'text': '*Would you be open to being interviewed by our team? *\nYes',
     'verbatim': False}},
   {'type': 'section',
    'block_id': 'aGz',
    'text': {'type': 'mrkdwn',
     'text': '*Any other feedback/questions for us? *\nYou guys are amazing',
     'verbatim': False}},
   {'type': 'section',
    'block_id': '3A4Y',
    'text': {'type': 'mrkdwn',
     'text': '*Order Number:*\n138405',
     'verbatim': False}},
   {'type': 'actions',
    'block_id': 'KNSXk',
    'elements': [{'type': 'button',
      'action_id': '4nR',
      'text': {'type': 'plain_text', 'text': 'View results', 'emoji': True},
      'url': 'www.example.com'}]},
   {'type': 'section',
    'block_id': 'kSjX4',
    'text': {'type': 'mrkdwn', 'text': ' ', 'verbatim': False}}]},
 {'type': 'message',
  'subtype': 'bot_message',
  'text': "This content can't be displayed.",
  'ts': '1654457958.167459',
  'username': 'Typeform',
  'icons': {'image_48': 'www.example.com'},
  'bot_id': 'BPD4K3SJW',
  'app_id': 'AD6SC3RT6',
  'blocks': [{'type': 'section',
    'block_id': 'Y8w',
    'text': {'type': 'mrkdwn',
     'text': 'You have a new message.',
     'verbatim': False}},
   {'type': 'section',
    'block_id': '9DGN',
    'text': {'type': 'mrkdwn',
     'text': '*Where did you first hear about us?*\nFriend',
     'verbatim': False}},
   {'type': 'section',
    'block_id': '=JF',
    'text': {'type': 'mrkdwn',
     'text': '*Reason for purchase: *\nBusiness',
     'verbatim': False}},
   {'type': 'section',
    'block_id': '3FY/t',
    'text': {'type': 'mrkdwn',
     'text': '*Where do you reside? *\nScotland',
     'verbatim': False}},
   {'type': 'section',
    'block_id': '8yz',
    'text': {'type': 'mrkdwn',
     'text': '*Are you purchasing online or in store? *\nOnline',
     'verbatim': False}},
   {'type': 'section',
    'block_id': 'PSm',
    'text': {'type': 'mrkdwn',
     'text': '*What was the location of your purchase? *\nLondon',
     'verbatim': False}},
   {'type': 'section',
    'block_id': 'fytG',
    'text': {'type': 'mrkdwn',
     'text': '*Would you be open to being interviewed by our team? *\nNo',
     'verbatim': False}},
   {'type': 'section',
    'block_id': 'S8LaL',
    'text': {'type': 'mrkdwn',
     'text': '*Order Number:*\n140829',
     'verbatim': False}},
   {'type': 'actions',
    'block_id': 'ZON',
    'elements': [{'type': 'button',
      'action_id': '02eV',
      'text': {'type': 'plain_text', 'text': 'View results', 'emoji': True},
      'url': 'www.example.com'}]},
   {'type': 'section',
    'block_id': 'zeVIW',
    'text': {'type': 'mrkdwn', 'text': ' ', 'verbatim': False}}]}]

I would like to create a Pandas dataframe that looks as follows:

ts                   heard_about_us   reason_for_purchase   customer_location   purchase_modality       purchase_location    agree_to_interview   comments                order_number
1654800705.715459    Advertisement    Leisure               New York            In Store                New York / City      Yes                  You guys are amazing    138405
1654457958.167459    Friend           Business              Scotland            Online                  London               No                                           140829

The column names align with the questions inside the text element (enclosed in asterisks '*').

To do so, I'm using the following code:

survey_data = pd.DataFrame(data=[[re.sub(".*[*]\\W ", "", val['text']['text']) for val in dat['blocks'] if val.get('text')][1:9] for dat in filtered_messages],
                 columns=['heard_about_us', 'reason_for_purchase', 'customer_location', 'purchase_modality', 'purchase_location', 'agree_to_interview', 'comments', 'order_number']
                  )

The result is the following:

ts                   heard_about_us   reason_for_purchase   customer_location   purchase_modality       purchase_location    agree_to_interview   comments                order_number
1654800705.715459    Advertisement    Leisure               New York            In Store                New York / City      Yes                  You guys are amazing    138405
1654457958.167459    Friend           Business              Scotland            Online                  London               No                   140829

Notice that the order ID 140829 is shifted to the left because there is no comments field in that survey record (because some of the survey questions are not mandatory).

To overcome this, I have tried to use pd.json.normalize() to create the df:

df = pd.json_normalize(filtered_messages,
                        record_path = ['blocks'],
                        meta = ['type', 'text'],
                        errors = 'ignore',
                        record_prefix = '_'
                       )

But, this results in a df with a _text.text column containing rows representing each of the elements that I'd like to see expressed as columns. Here is a small sample of the output:

     type      _block_id    _text.type  _text.text                                          _text.verbatim    _elements    type     text
0    section   GUoS         mrkdwn      You have a new message.                             False             NaN          message  This content can't be displayed.
1    section   abZ          mrkdwn      *Where did you first hear about us?*\nAdvertis...   False             NaN          message  This content can't be displayed.
2    section   lEd          mrkdwn      *Reason for purchase: *\nLeisure                    False             NaN          message  This content can't be displayed.
3    section   4=vEY        mrkdwn      *Where do you reside? *\nNew York                   False             NaN          message  This content can't be displayed.
4    section   P1SKA        mrkdwn      *Are you purchasing online or in store? *\nIn ...   False             NaN          message  This content can't be displayed.
5    section   f74v         mrkdwn      *What was the location of your purchase? *\nNe...   False             NaN          message  This content can't be displayed.
6    section   3tr6         mrkdwn      *Would you be open to being interviewed by our...   False             NaN          message  This content can't be displayed.
7    section   aGz          mrkdwn      *Any other feedback/questions for us? *\nYou g...   False             NaN          message  This content can't be displayed.
8    section   3A4Y         mrkdwn      *Order Number:*\n138405                             False             NaN          message  This content can't be displayed.
9    actions   KNSXk        NaN         NaN                                                 NaN               [{'type': 'button', 'action_id': '4nR', 'text'... message This content can't be displayed

How would I access the required elements inside of blocks to create the df? Potentially using the questions enclosed inside the asterisks (*) as the keys of a dictionary? Thanks!

CodePudding user response：

You can use:

# Mapping
mapping = {
    'Where did you first hear about us?': 'heard_about_us',
    'Reason for purchase:': 'reason_for_purchase',
    'Where do you reside?': 'customer_location',
    'Are you purchasing online or in store?': 'purchase_modality',
    'What was the location of your purchase?': 'purchase_location',
    'Would you be open to being interviewed by our team?': 'agree_to_interview',
    'Any other feedback/questions for us?': 'comments',
    'Order Number:': 'order_number'
}

# Use your options here but don't forget 'ts' meta
df = pd.json_normalize(filtered_messages, record_path=['blocks'], 
                       meta=['ts', 'type', 'text'], record_prefix='_')

# Extract question/answer
df1 = df['_text.text'].str.extract(r'\*(?P<question>[^\*] )\*\s (?P<answer>.*)')

# Create final dataframe
out = pd.concat([df['ts'], df1], axis=1).dropna()
out['question'] = out['question'].str.strip().replace(mapping)

# Reshape it
out = (out.pivot('ts', 'question', 'answer')[mapping.values()]
          .reset_index().rename_axis(columns=None))

Output:

>>> out
                  ts heard_about_us reason_for_purchase customer_location purchase_modality purchase_location agree_to_interview              comments order_number
0  1654457958.167459         Friend            Business          Scotland            Online            London                 No                   NaN       140829
1  1654800705.715459  Advertisement             Leisure          New York          In Store   New York / City                Yes  You guys are amazing       138405

CodePudding user response：

df = pd.json_normalize(filtered_messages, ['blocks'], 'ts')[['ts', 'text.text']]
df.columns = ['ts', 'text']
df.text.replace(' ', np.nan, inplace=True)
df.text.replace('You have a new message.', np.nan, inplace=True)
df.dropna(inplace=True)
df[['question', 'answer']] = df.apply(lambda x: x.text.strip('*').split('*'), axis=1, result_type='expand')
df = df.pivot(index='ts', columns='question', values='answer')
df.columns = ['comments', 'purchase_modality', 'order_number', 'reason_for_purchase', 'purchase_location', 'heard_about_us', 'customer_location', 'agree_to_interview']
df = df.applymap(lambda x: x.strip().strip('\\'), na_action='ignore')
df.reset_index(inplace=True)

Output:

                  ts              comments purchase_modality order_number reason_for_purchase purchase_location heard_about_us customer_location agree_to_interview
0  1654457958.167459                   NaN            Online       140829            Business            London         Friend          Scotland                 No
1  1654800705.715459  You guys are amazing          In Store       138405             Leisure   New York / City  Advertisement          New York                Yes