Home > Software engineering >  extract a subset of deep embed json and print only key,value pair I am interested in the subset json
extract a subset of deep embed json and print only key,value pair I am interested in the subset json

Time:05-02

I have a deep embeded json file: I want to extract and parse only the subset I am interested in , in my case all content in 'node' key. How can I:

  1. extract subset of this json file which contains "edges[].node" (edges is the 'parent' key of node)

  2. in 'node' session , I am interested in key:value pair of

    .url,
    .headline.default, (*this one is 'grandchild' of key 'node'*)
    .firstPublished
    

    I want to keep only above 3 item inside 'node' key How can I print out the super slim version of json file I need ?

  3. a better to have option is : can I still keep the structure/full path which leads json root key to embed 'node' json subset I am interested in ?

Here is the jqplay-myjson (full content of my json file)

Try to attach my full content here :

{
  "data": {
    "legacyCollection": {
      "longDescription": "The latest news, analysis and investigations from Europe.",
      "section": {
        "name": "world",
        "url": "/section/world"
      },
      "collectionsPage": {
        "stream": {
          "pageInfo": {
            "hasNextPage": true,
            "__typename": "PageInfo"
          },
          "__typename": "AssetsConnection",
          "edges": [
            {
              "node": {
                "url": "https://www.nytimes.com/video/world/europe/100000008323381/icc-war-crimes-ukraine.html",
                "firstPublished": "2022-04-27T23:28:33.241Z",
                "headline": {
                  "default": "I.C.C. Joins Investigation of War Crimes in Ukraine",
                  "__typename": "CreativeWorkHeadline"
                },
                "summary": "Karim Khan, the chief prosecutor of the International Criminal Court, said that his organization would participate in a joint effort — with Ukraine, Poland and Lithuania — to investigate war crimes committed since Russia’s invasion.",
                "promotionalMedia": {
                  "__typename": "Image",
                  "id": "SW1hZ2U6bnl0Oi8vaW1hZ2UvYTY3MTVhNDUtZDE0NS01OWZjLThkZWItNzYxMWViN2UyODhk"
                },
                "embedded": false
              },
              "__typename": "AssetsEdge"
            },
            {
              "node": {
                "__typename": "Article",
                "url": "https://www.nytimes.com/2022/04/27/sports/soccer/chelsea-sale-roman-abramovich.html",
                "firstPublished": "2022-04-27T19:42:17.000Z",
                "typeOfMaterials": [
                  "News"
                ],
                "archiveProperties": {
                  "lede": "",
                  "__typename": "ArticleArchiveProperties"
                },
                "headline": {
                  "default": "Endgame Nears in Bidding for Chelsea F.C.",
                  "__typename": "CreativeWorkHeadline"
                },
                "summary": "The American bank selling the English soccer team on behalf of its Russian owner could name its preferred suitor by the end of the week. But the drama isn’t over.",
                "translations": []
              },
              "__typename": "AssetsEdge"
            }
          ],
          "totalCount": 52559
        }
      },
      "sourceId": "100000004047788",
      "tagline": "",
      "__typename": "LegacyCollection"
    }
  }
}

Here is the command I have jqplay Demo:

.data.legacyCollection.collectionsPage.stream.edges[].node|= with_entries(select([.key]|inside(["default","url","firstPublished"]))

And here is the output I got

{
  "data": {
    "legacyCollection": {
      "longDescription": "The latest news, analysis and investigations from Europe.",
      "section": {
        "name": "world",
        "url": "/section/world"
      },
      "collectionsPage": {
        "stream": {
          "pageInfo": {
            "hasNextPage": true,
            "__typename": "PageInfo"
          },
          "__typename": "AssetsConnection",
          "edges": [
            {
              "node": {
                "url": "https://www.nytimes.com/video/world/europe/100000008323381/icc-war-crimes-ukraine.html",
                "firstPublished": "2022-04-27T23:28:33.241Z"
              },
              "__typename": "AssetsEdge"
            },
            {
              "node": {
                "url": "https://www.nytimes.com/2022/04/27/sports/soccer/chelsea-sale-roman-abramovich.html",
                "firstPublished": "2022-04-27T19:42:17.000Z"
              },
              "__typename": "AssetsEdge"
            }
          ],
          "totalCount": 52559
        }
      },
      "sourceId": "100000004047788",
      "tagline": "",
      "__typename": "LegacyCollection"
    }
  }
}

Here is the output I expect to have

{
  "data": {
    "legacyCollection": {
      "collectionsPage": {
        "stream": {
          "edges": [
            {
              "node": {
                "url": "https://www.nytimes.com/video/world/europe/100000008323381/icc-war-crimes-ukraine.html",
                "firstPublished": "2022-04-27T23:28:33.241Z"
              }
            },
            {
              "node": {
                "url": "https://www.nytimes.com/2022/04/27/sports/soccer/chelsea-sale-roman-abramovich.html",
                "firstPublished": "2022-04-27T19:42:17.000Z"
              }
            }
          ]
        }
      }
    }
  }
}

CodePudding user response:

Here's a (somewhat) declarative solution:

.data.legacyCollection.collectionsPage.stream.edges as $edges
| {data: {
     legacyCollection: {
       collectionsPage: {
         stream: {
           edges: ($edges | map( 
             {node: (.node|{url,
                            firstPublished,
                            headline: {default: .headline.default} })}))
         }
       }
     }
   }
  }

CodePudding user response:

Here's one way to make the selection while ensuring that the structure is preserved. This solution may be of interest because it can easily be adapted for use with jq's "--stream" option.

def array_startswith($head): .[: $head|length] == $head;

. as $in
| ["data", "legacyCollection", "collectionsPage", "stream", "edges"] as $head
| ($head|length) as $len
| reduce (paths
          | select( array_startswith($head) and .[1 $len] == "node" )) as $p
    (null;
     if ((($p|length) == $len   3) and ($p[-1] | IN("url", "firstPublished")))
        or ((($p|length) == $len   4) and $p[-2:] == ["headline", "default"])
     then setpath($p; $in | getpath($p))
     else .
     end)
  • Related