How to read json with htlm as a pandas dataframe-CodePudding

I'm trying to convert this data : https://cv.iptc.org/newscodes/mediatopic/ to a pandas dataframe.

I'm mainly interested in the ''Q-code'' e.g. Concept ID (QCode) = medtop:01000000 and the belonging Name(en-GB), e.g. arts, culture, entertainment and media .

My best attempt so far is to download the data as a json file. On the top of the site theres a link ''View this scheme in other formats: NewsML G2 Knowledge Item | RDF/XML | RDF/Turtle | JSON-LD'' .

When I downloaded the json file I had to remove the first part:

"@context": "https://www.iptc.org/std/IKOS/IKOS.jsonld", 
"uri": "http://cv.iptc.org/newscodes/mediatopic/", 
"type": "http://www.w3.org/2004/02/skos/core#ConceptScheme", 
"prefSchemeAlias": "medtop", 
"authority": "http://www.iptc.org", 
"copyrightHolder": "IPTC, International Press Telecommunications Council - https://iptc.org", 
"licenceLink": "http://creativecommons.org/licenses/by/4.0/", 
"dateReleased": "2022-07-07T12:00:00 00:00", 
"prefLabel" : {
"en-GB" : "Media Topic"},
"definition" : {
"en-GB" : "Indicates a subject of an item."},
"note" : {
"en-GB" : "The Media Topic NewsCodes has been IPTC's primary subject taxonomy since 2010, with a focus on classification of text. The development started with our previous Subject Codes taxonomy and extended the tree to 5 levels and reused the same 17 top level terms. The terms below the top level have been revised and rearranged. Most Media Topic concepts provide a mapping back to one of the Subject Codes, and many provide a mapping to Wikidata."},
"hasTopConcept" : [
"http://cv.iptc.org/newscodes/mediatopic/01000000", "http://cv.iptc.org/newscodes/mediatopic/02000000", "http://cv.iptc.org/newscodes/mediatopic/03000000", "http://cv.iptc.org/newscodes/mediatopic/04000000", "http://cv.iptc.org/newscodes/mediatopic/05000000", "http://cv.iptc.org/newscodes/mediatopic/06000000", "http://cv.iptc.org/newscodes/mediatopic/07000000", "http://cv.iptc.org/newscodes/mediatopic/08000000", "http://cv.iptc.org/newscodes/mediatopic/09000000", "http://cv.iptc.org/newscodes/mediatopic/10000000", "http://cv.iptc.org/newscodes/mediatopic/11000000", "http://cv.iptc.org/newscodes/mediatopic/12000000", "http://cv.iptc.org/newscodes/mediatopic/13000000", "http://cv.iptc.org/newscodes/mediatopic/14000000", "http://cv.iptc.org/newscodes/mediatopic/15000000", "http://cv.iptc.org/newscodes/mediatopic/16000000", "http://cv.iptc.org/newscodes/mediatopic/17000000"
],

Then I loaded the json file in my jupyter notebook with :

import json 
df = pd.read_json("cptall-en-GB.json")

This turns out as two columns, one column with indexes and one column with just a long string with all information. The first two examples turn out like following:

{'conceptSet': {0: {'uri': 'http://cv.iptc.org/newscodes/mediatopic/01000000',
   'qcode': 'medtop:01000000',
   'type': ['http://www.w3.org/2004/02/skos/core#Concept'],
   'inScheme': ['http://cv.iptc.org/newscodes/mediatopic/'],
   'modified': '2021-02-18T12:00:00 00:00',
   'prefLabel': {'en-GB': 'arts, culture, entertainment and media'},
   'definition': {'en-GB': 'All forms of arts, entertainment, cultural heritage and media'},
   'narrower': ['medtop:20000002', 'medtop:20000038', 'medtop:20000045'],
   'exactMatch': ['http://cv.iptc.org/newscodes/subjectcode/01000000'],
   'created': '2009-10-22T02:00:00 00:00'},
  1: {'uri': 'http://cv.iptc.org/newscodes/mediatopic/02000000',
   'qcode': 'medtop:02000000',
   'type': ['http://www.w3.org/2004/02/skos/core#Concept'],
   'inScheme': ['http://cv.iptc.org/newscodes/mediatopic/'],
   'modified': '2021-05-05T12:00:00 00:00',
   'prefLabel': {'en-GB': 'crime, law and justice'},
   'definition': {'en-GB': 'The establishment and/or statement of the rules of behaviour in society, the enforcement of these rules, breaches of the rules, the punishment of offenders and the organisations and bodies involved in these activities'},
   'narrower': ['medtop:20000082',
    'medtop:20000106',
    'medtop:20000119',
    'medtop:20000121',
    'medtop:20000129'],
   'exactMatch': ['http://cv.iptc.org/newscodes/subjectcode/02000000',
    'https://www.wikidata.org/entity/Q146491'],
   'created': '2009-10-22T02:00:00 00:00'}}}

Any advice on how to turn this into a better looking dataframe with all fields as columns?

CodePudding user response：

You can remove

"conceptSet":

from the start and

from the end. After that try reading again. I've downloaded and imported the dataset. Working fine for me.

My Code:

import pandas as pd

df = pd.read_json(r"C:\Users\hp\Downloads\cptall-en-US.json")
print(df)

My Output:

                                                   uri  ... hasFacet
0     http://cv.iptc.org/newscodes/mediatopic/01000000  ...      NaN
1     http://cv.iptc.org/newscodes/mediatopic/02000000  ...      NaN
2     http://cv.iptc.org/newscodes/mediatopic/03000000  ...      NaN
3     http://cv.iptc.org/newscodes/mediatopic/04000000  ...      NaN
4     http://cv.iptc.org/newscodes/mediatopic/05000000  ...      NaN
...                                                ...  ...      ...
1351  http://cv.iptc.org/newscodes/mediatopic/20001355  ...      NaN
1352  http://cv.iptc.org/newscodes/mediatopic/20001356  ...      NaN
1353  http://cv.iptc.org/newscodes/mediatopic/20001357  ...      NaN
1354  http://cv.iptc.org/newscodes/mediatopic/20001358  ...      NaN
1355  http://cv.iptc.org/newscodes/mediatopic/20001359  ...      NaN

[1356 rows x 15 columns]

Some data from my JSON file:

[
    {
        "uri": "http://cv.iptc.org/newscodes/mediatopic/01000000",
        "qcode": "medtop:01000000",
        "type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
        ],
        "inScheme": [
            "http://cv.iptc.org/newscodes/mediatopic/"
        ],
        "modified": "2021-02-18T12:00:00 00:00",
        "prefLabel": {
            "en-US": "arts, culture, entertainment and media"
        },
        "definition": {
            "en-US": "All forms of arts, entertainment, cultural heritage and media"
        },
        "narrower": [
            "medtop:20000002",
            "medtop:20000038",
            "medtop:20000045"
        ],
        "exactMatch": [
            "http://cv.iptc.org/newscodes/subjectcode/01000000"
        ],
        "created": "2009-10-22T02:00:00 00:00"
    },
    {
        "uri": "http://cv.iptc.org/newscodes/mediatopic/02000000",
        "qcode": "medtop:02000000",
        "type": [
            "http://www.w3.org/2004/02/skos/core#Concept"
        ],
        "inScheme": [
            "http://cv.iptc.org/newscodes/mediatopic/"
        ],
        "modified": "2021-05-05T12:00:00 00:00",
        "prefLabel": {
            "en-US": "crime, law and justice"
        },
        "definition": {
            "en-US": "The establishment and/or statement of the rules of behavior in society, the enforcement of these rules, breaches of the rules, the punishment of offenders and the organizations and bodies involved in these activities"
        },
        "narrower": [
            "medtop:20000082",
            "medtop:20000106",
            "medtop:20000119",
            "medtop:20000121",
            "medtop:20000129"
        ],
        "exactMatch": [
            "http://cv.iptc.org/newscodes/subjectcode/02000000",
            "https://www.wikidata.org/entity/Q146491"
        ],
        "created": "2009-10-22T02:00:00 00:00"
    }]