I'm using an API to retrieve a list of nested dictionaries containing data from an online survey. The data looks as follows:
filtered_messages = [{'type': 'message',
'subtype': 'bot_message',
'text': "This content can't be displayed.",
'ts': '1654800705.715459',
'username': 'Typeform',
'icons': {'image_48': 'www.example.com'},
'bot_id': 'ABC4K1SJE',
'app_id': 'DEFSC3RT6',
'blocks': [{'type': 'section',
'block_id': 'GUoS',
'text': {'type': 'mrkdwn',
'text': 'You have a new message.',
'verbatim': False}},
{'type': 'section',
'block_id': 'abZ',
'text': {'type': 'mrkdwn',
'text': '*Where did you first hear about us?*\nAdvertisement',
'verbatim': False}},
{'type': 'section',
'block_id': 'lEd',
'text': {'type': 'mrkdwn',
'text': '*Reason for purchase: *\nLeisure',
'verbatim': False}},
{'type': 'section',
'block_id': '4=vEY',
'text': {'type': 'mrkdwn',
'text': '*Where do you reside? *\nNew York',
'verbatim': False}},
{'type': 'section',
'block_id': 'P1SKA',
'text': {'type': 'mrkdwn',
'text': '*Are you purchasing online or in store? *\nIn Store',
'verbatim': False}},
{'type': 'section',
'block_id': 'f74v',
'text': {'type': 'mrkdwn',
'text': '*What was the location of your purchase? *\nNew York / City',
'verbatim': False}},
{'type': 'section',
'block_id': '3tr6',
'text': {'type': 'mrkdwn',
'text': '*Would you be open to being interviewed by our team? *\nYes',
'verbatim': False}},
{'type': 'section',
'block_id': 'aGz',
'text': {'type': 'mrkdwn',
'text': '*Any other feedback/questions for us? *\nYou guys are amazing',
'verbatim': False}},
{'type': 'section',
'block_id': '3A4Y',
'text': {'type': 'mrkdwn',
'text': '*Order Number:*\n138405',
'verbatim': False}},
{'type': 'actions',
'block_id': 'KNSXk',
'elements': [{'type': 'button',
'action_id': '4nR',
'text': {'type': 'plain_text', 'text': 'View results', 'emoji': True},
'url': 'www.example.com'}]},
{'type': 'section',
'block_id': 'kSjX4',
'text': {'type': 'mrkdwn', 'text': ' ', 'verbatim': False}}]},
{'type': 'message',
'subtype': 'bot_message',
'text': "This content can't be displayed.",
'ts': '1654457958.167459',
'username': 'Typeform',
'icons': {'image_48': 'www.example.com'},
'bot_id': 'BPD4K3SJW',
'app_id': 'AD6SC3RT6',
'blocks': [{'type': 'section',
'block_id': 'Y8w',
'text': {'type': 'mrkdwn',
'text': 'You have a new message.',
'verbatim': False}},
{'type': 'section',
'block_id': '9DGN',
'text': {'type': 'mrkdwn',
'text': '*Where did you first hear about us?*\nFriend',
'verbatim': False}},
{'type': 'section',
'block_id': '=JF',
'text': {'type': 'mrkdwn',
'text': '*Reason for purchase: *\nBusiness',
'verbatim': False}},
{'type': 'section',
'block_id': '3FY/t',
'text': {'type': 'mrkdwn',
'text': '*Where do you reside? *\nScotland',
'verbatim': False}},
{'type': 'section',
'block_id': '8yz',
'text': {'type': 'mrkdwn',
'text': '*Are you purchasing online or in store? *\nOnline',
'verbatim': False}},
{'type': 'section',
'block_id': 'PSm',
'text': {'type': 'mrkdwn',
'text': '*What was the location of your purchase? *\nLondon',
'verbatim': False}},
{'type': 'section',
'block_id': 'fytG',
'text': {'type': 'mrkdwn',
'text': '*Would you be open to being interviewed by our team? *\nNo',
'verbatim': False}},
{'type': 'section',
'block_id': 'S8LaL',
'text': {'type': 'mrkdwn',
'text': '*Order Number:*\n140829',
'verbatim': False}},
{'type': 'actions',
'block_id': 'ZON',
'elements': [{'type': 'button',
'action_id': '02eV',
'text': {'type': 'plain_text', 'text': 'View results', 'emoji': True},
'url': 'www.example.com'}]},
{'type': 'section',
'block_id': 'zeVIW',
'text': {'type': 'mrkdwn', 'text': ' ', 'verbatim': False}}]}]
I would like to create a Pandas dataframe that looks as follows:
ts heard_about_us reason_for_purchase customer_location purchase_modality purchase_location agree_to_interview comments order_number
1654800705.715459 Advertisement Leisure New York In Store New York / City Yes You guys are amazing 138405
1654457958.167459 Friend Business Scotland Online London No 140829
The column names align with the questions inside the text
element (enclosed in asterisks '*').
To do so, I'm using the following code:
survey_data = pd.DataFrame(data=[[re.sub(".*[*]\\W ", "", val['text']['text']) for val in dat['blocks'] if val.get('text')][1:9] for dat in filtered_messages],
columns=['heard_about_us', 'reason_for_purchase', 'customer_location', 'purchase_modality', 'purchase_location', 'agree_to_interview', 'comments', 'order_number']
)
The result is the following:
ts heard_about_us reason_for_purchase customer_location purchase_modality purchase_location agree_to_interview comments order_number
1654800705.715459 Advertisement Leisure New York In Store New York / City Yes You guys are amazing 138405
1654457958.167459 Friend Business Scotland Online London No 140829
Notice that the order ID 140829
is shifted to the left because there is no comments
field in that survey record (because some of the survey questions are not mandatory).
To overcome this, I have tried to use pd.json.normalize()
to create the df
:
df = pd.json_normalize(filtered_messages,
record_path = ['blocks'],
meta = ['type', 'text'],
errors = 'ignore',
record_prefix = '_'
)
But, this results in a df
with a _text.text
column containing rows representing each of the elements that I'd like to see expressed as columns. Here is a small sample of the output:
type _block_id _text.type _text.text _text.verbatim _elements type text
0 section GUoS mrkdwn You have a new message. False NaN message This content can't be displayed.
1 section abZ mrkdwn *Where did you first hear about us?*\nAdvertis... False NaN message This content can't be displayed.
2 section lEd mrkdwn *Reason for purchase: *\nLeisure False NaN message This content can't be displayed.
3 section 4=vEY mrkdwn *Where do you reside? *\nNew York False NaN message This content can't be displayed.
4 section P1SKA mrkdwn *Are you purchasing online or in store? *\nIn ... False NaN message This content can't be displayed.
5 section f74v mrkdwn *What was the location of your purchase? *\nNe... False NaN message This content can't be displayed.
6 section 3tr6 mrkdwn *Would you be open to being interviewed by our... False NaN message This content can't be displayed.
7 section aGz mrkdwn *Any other feedback/questions for us? *\nYou g... False NaN message This content can't be displayed.
8 section 3A4Y mrkdwn *Order Number:*\n138405 False NaN message This content can't be displayed.
9 actions KNSXk NaN NaN NaN [{'type': 'button', 'action_id': '4nR', 'text'... message This content can't be displayed
How would I access the required elements inside of blocks
to create the df
? Potentially using the questions enclosed inside the asterisks (*) as the keys of a dictionary? Thanks!
CodePudding user response:
You can use:
# Mapping
mapping = {
'Where did you first hear about us?': 'heard_about_us',
'Reason for purchase:': 'reason_for_purchase',
'Where do you reside?': 'customer_location',
'Are you purchasing online or in store?': 'purchase_modality',
'What was the location of your purchase?': 'purchase_location',
'Would you be open to being interviewed by our team?': 'agree_to_interview',
'Any other feedback/questions for us?': 'comments',
'Order Number:': 'order_number'
}
# Use your options here but don't forget 'ts' meta
df = pd.json_normalize(filtered_messages, record_path=['blocks'],
meta=['ts', 'type', 'text'], record_prefix='_')
# Extract question/answer
df1 = df['_text.text'].str.extract(r'\*(?P<question>[^\*] )\*\s (?P<answer>.*)')
# Create final dataframe
out = pd.concat([df['ts'], df1], axis=1).dropna()
out['question'] = out['question'].str.strip().replace(mapping)
# Reshape it
out = (out.pivot('ts', 'question', 'answer')[mapping.values()]
.reset_index().rename_axis(columns=None))
Output:
>>> out
ts heard_about_us reason_for_purchase customer_location purchase_modality purchase_location agree_to_interview comments order_number
0 1654457958.167459 Friend Business Scotland Online London No NaN 140829
1 1654800705.715459 Advertisement Leisure New York In Store New York / City Yes You guys are amazing 138405
CodePudding user response:
df = pd.json_normalize(filtered_messages, ['blocks'], 'ts')[['ts', 'text.text']]
df.columns = ['ts', 'text']
df.text.replace(' ', np.nan, inplace=True)
df.text.replace('You have a new message.', np.nan, inplace=True)
df.dropna(inplace=True)
df[['question', 'answer']] = df.apply(lambda x: x.text.strip('*').split('*'), axis=1, result_type='expand')
df = df.pivot(index='ts', columns='question', values='answer')
df.columns = ['comments', 'purchase_modality', 'order_number', 'reason_for_purchase', 'purchase_location', 'heard_about_us', 'customer_location', 'agree_to_interview']
df = df.applymap(lambda x: x.strip().strip('\\'), na_action='ignore')
df.reset_index(inplace=True)
Output:
ts comments purchase_modality order_number reason_for_purchase purchase_location heard_about_us customer_location agree_to_interview
0 1654457958.167459 NaN Online 140829 Business London Friend Scotland No
1 1654800705.715459 You guys are amazing In Store 138405 Leisure New York / City Advertisement New York Yes