Fastest way to discriminate values in JSON by a nested value for large JSON files-CodePudding

I have a JSON file (the largest file available here, 13.2 GB). Here's an example of the format:

{
  "head_templates": [
    {
      "args": {
        "1": "ast",
        "2": "adverb",
        "head": ""
      },
      "expansion": "más o menos",
      "name": "head"
    },
    {
      "args": {},
      "expansion": "más o menos",
      "name": "ast-adv"
    }
  ],
  "lang": "Asturian",
  "lang_code": "ast",
  "pos": "adv",
  "senses": [
    {
      "categories": [
        "Asturian adverbs",
        "Asturian lemmas",
        "Asturian multiword terms"
      ],
      "glosses": [
        "more or less (approximately)"
      ],
      "raw_glosses": [
        "more or less (approximately)"
      ]
    }
  ],
  "word": "más o menos"
}

{
  "categories": [
    "Spanish adverbs",
    "Spanish lemmas",
    "Spanish multiword terms",
    "Spanish terms with IPA pronunciation"
  ],
  "head_templates": [
    {
      "args": {},
      "expansion": "más o menos",
      "name": "es-adv"
    }
  ],
  "lang": "Spanish",
  "lang_code": "es",
  "pos": "adv",
  "related": [
    {
      "word": "quien más quien menos"
    }
  ],
  "senses": [
    {
      "categories": [
        "Spanish terms with usage examples"
      ],
      "examples": [
        {
          "english": "It's approximately ten dollars.",
          "text": "Es más o menos diez dólares.",
          "type": "example"
        }
      ],
      "glosses": [
        "give or take, more or less, approximately; pretty much"
      ],
      "raw_glosses": [
        "give or take, more or less, approximately; pretty much"
      ],
      "synonyms": [
        {
          "word": "aproximadamente"
        },
        {
          "word": "cerca de"
        }
      ]
    },
    {
      "glosses": [
        "so-so (neither good nor bad)"
      ],
      "raw_glosses": [
        "so-so (neither good nor bad)"
      ]
    }
  ],
  "sounds": [
    {
      "ipa": "/ˌmas o ˈmenos/"
    },
    {
      "ipa": "[ˌmas o ˈme.nos]"
    }
  ],
  "word": "más o menos"
}

The format above is has sub-entries for each language (in this case two, "Asturian" and "Spanish". Within the "Spanish" entry, I want to capture the "word" and store it in a list ONLY if "Spanish multiword terms" appears in the categories for the Spanish entry. What is the most effective way to do this?

CodePudding user response：

Looking at your data, it doesn't look like a properly formatted json. Rather more like one json object per line (it's also written in your link).

From there you can loop over your file lines and extract the desired values:

import json

word_list = []
with open("D:\\raw-wiktextract-data.json", 'r', encoding='utf-8') as f:
    for line in f:
        tmp = json.loads(line)
        if any("Spanish multiword terms" in s.get('categories', []) for s in tmp.get('senses', [])):
            word_list.append(tmp.get('word', ''))