I have a dataframe with an array column in the below structure :
valueA | value B |
---|---|
First | [{"posts":[{"body":"post A"},{"body":"post B"},{"body":"post C"}],"subject":[{"id":"5571196","name":"author A","timestamp":"2021-10-20T08:41:09.124Z"},{"id":"1076004","name":"author B","timestamp":"2021-12-25T07:34:27.524Z"},{"id":"1077532","name":"author C","timestamp":"2022-10-10T20:48:27.416Z"}],"communityName":"Cooking"}] |
JSON structure:
[{
"posts": [{
"body": "post A"
}, {
"body": "post B"
}, {
"body": "post C"
}],
"subject": [{
"id": "5571196",
"name": "author A",
"timestamp": "2021-10-20T08:41:09.124Z"
}, {
"id": "1076004",
"name": "author B",
"timestamp": "2021-12-25T07:34:27.524Z"
}, {
"id": "1077532",
"name": "author C",
"timestamp": "2022-10-10T20:48:27.416Z"
}],
"communityName": "Cooking"
}]
DESIRED OUTPUT:
value A | postsBody | author | authorId | timestamp | communityName |
---|---|---|---|---|---|
First | post A | author A | 5571196 | 2021-10-20T08:41:09.124Z | Cooking |
First | post B | author B | 1076004 | 2021-12-25T07:34:27.524Z | Cooking |
First | post C | author C | 1077532 | 2022-10-10T20:48:27.416Z | Cooking |
my original approach was to explode the posts array separately then the subject array and join them, but i keep getting
value A | postsBody | author | authorId | timestamp | communityName |
---|---|---|---|---|---|
First | post A | author A | 5571196 | 2021-10-20T08:41:09.124Z | Cooking |
First | post B | author A | 5571196 | 2021-10-20T08:41:09.124Z | Cooking |
First | post C | author A | 5571196 | 2021-10-20T08:41:09.124Z | Cooking |
First | post A | author B | 1076004 | 2021-12-25T07:34:27.524Z | Cooking |
First | post B | author B | 1076004 | 2021-12-25T07:34:27.524Z | Cooking |
First | post C | author B | 1076004 | 2021-12-25T07:34:27.524Z | Cooking |
First | post A | author C | 1077532 | 2022-10-10T20:48:27.416Z | Cooking |
First | post B | author C | 1077532 | 2022-10-10T20:48:27.416Z | Cooking |
First | post C | author C | 1077532 | 2022-10-10T20:48:27.416Z | Cooking |
how can i properly map the respective posts and authors and get to the desired output?
CodePudding user response:
If the json format is always consistent as in your example:
- use the json module to read the string in
valueB
-->stuff
stuff
is a list with one item sostuff = stuff[0]
- the body and posts are associated positionally in the data so zip them then iterate
for (body,post) in zip(stuff['posts'],stuff['subject']):
- that produces dictionary pairs
{'body': 'mediterranean chicken kebab recipe'} {'id': '5571196', 'name': 'kayla obrien', 'timestamp': '2021-10-20T08:41:09.124Z'}
- that produces dictionary pairs
- extract the info you want with the dictionary keys and construct a list-of-lists or something that can be fed to a DataFrame
- or maybe just combine those dicts to feed to the DataFrame
body.update(post)
{'body': 'mediterranean chicken kebab recipe', 'id': '5571196', 'name': 'kayla obrien', 'timestamp': '2021-10-20T08:41:09.124Z'}