Home > Software engineering >  How to properly explode json and map to respective objects within arrays?
How to properly explode json and map to respective objects within arrays?

Time:10-26

I have a dataframe with an array column in the below structure :

valueA value B
First [{"posts":[{"body":"post A"},{"body":"post B"},{"body":"post C"}],"subject":[{"id":"5571196","name":"author A","timestamp":"2021-10-20T08:41:09.124Z"},{"id":"1076004","name":"author B","timestamp":"2021-12-25T07:34:27.524Z"},{"id":"1077532","name":"author C","timestamp":"2022-10-10T20:48:27.416Z"}],"communityName":"Cooking"}]

JSON structure:

[{
    "posts": [{
        "body": "post A"
    }, {
        "body": "post B"
    }, {
        "body": "post C"
    }],
    "subject": [{
        "id": "5571196",
        "name": "author A",
        "timestamp": "2021-10-20T08:41:09.124Z"
    }, {
        "id": "1076004",
        "name": "author B",
        "timestamp": "2021-12-25T07:34:27.524Z"
    }, {
        "id": "1077532",
        "name": "author C",
        "timestamp": "2022-10-10T20:48:27.416Z"
    }],
    "communityName": "Cooking"
}]

DESIRED OUTPUT:

value A postsBody author authorId timestamp communityName
First post A author A 5571196 2021-10-20T08:41:09.124Z Cooking
First post B author B 1076004 2021-12-25T07:34:27.524Z Cooking
First post C author C 1077532 2022-10-10T20:48:27.416Z Cooking

my original approach was to explode the posts array separately then the subject array and join them, but i keep getting

value A postsBody author authorId timestamp communityName
First post A author A 5571196 2021-10-20T08:41:09.124Z Cooking
First post B author A 5571196 2021-10-20T08:41:09.124Z Cooking
First post C author A 5571196 2021-10-20T08:41:09.124Z Cooking
First post A author B 1076004 2021-12-25T07:34:27.524Z Cooking
First post B author B 1076004 2021-12-25T07:34:27.524Z Cooking
First post C author B 1076004 2021-12-25T07:34:27.524Z Cooking
First post A author C 1077532 2022-10-10T20:48:27.416Z Cooking
First post B author C 1077532 2022-10-10T20:48:27.416Z Cooking
First post C author C 1077532 2022-10-10T20:48:27.416Z Cooking

how can i properly map the respective posts and authors and get to the desired output?

CodePudding user response:

If the json format is always consistent as in your example:

  • use the json module to read the string in valueB --> stuff
  • stuff is a list with one item so stuff = stuff[0]
  • the body and posts are associated positionally in the data so zip them then iterate
    for (body,post) in zip(stuff['posts'],stuff['subject']):
    • that produces dictionary pairs
      {'body': 'mediterranean chicken kebab recipe'} {'id': '5571196', 'name': 'kayla obrien', 'timestamp': '2021-10-20T08:41:09.124Z'}
  • extract the info you want with the dictionary keys and construct a list-of-lists or something that can be fed to a DataFrame
  • or maybe just combine those dicts to feed to the DataFrame body.update(post)
    {'body': 'mediterranean chicken kebab recipe', 'id': '5571196', 'name': 'kayla obrien', 'timestamp': '2021-10-20T08:41:09.124Z'}
  • Related