I'm trying to convert this data : https://cv.iptc.org/newscodes/mediatopic/ to a pandas dataframe.
I'm mainly interested in the ''Q-code'' e.g. Concept ID (QCode) = medtop:01000000
and the belonging Name(en-GB)
, e.g. arts, culture, entertainment and media .
My best attempt so far is to download the data as a json file. On the top of the site theres a link ''View this scheme in other formats: NewsML G2 Knowledge Item | RDF/XML | RDF/Turtle | JSON-LD'' .
When I downloaded the json file I had to remove the first part:
"@context": "https://www.iptc.org/std/IKOS/IKOS.jsonld",
"uri": "http://cv.iptc.org/newscodes/mediatopic/",
"type": "http://www.w3.org/2004/02/skos/core#ConceptScheme",
"prefSchemeAlias": "medtop",
"authority": "http://www.iptc.org",
"copyrightHolder": "IPTC, International Press Telecommunications Council - https://iptc.org",
"licenceLink": "http://creativecommons.org/licenses/by/4.0/",
"dateReleased": "2022-07-07T12:00:00 00:00",
"prefLabel" : {
"en-GB" : "Media Topic"},
"definition" : {
"en-GB" : "Indicates a subject of an item."},
"note" : {
"en-GB" : "The Media Topic NewsCodes has been IPTC's primary subject taxonomy since 2010, with a focus on classification of text. The development started with our previous Subject Codes taxonomy and extended the tree to 5 levels and reused the same 17 top level terms. The terms below the top level have been revised and rearranged. Most Media Topic concepts provide a mapping back to one of the Subject Codes, and many provide a mapping to Wikidata."},
"hasTopConcept" : [
"http://cv.iptc.org/newscodes/mediatopic/01000000", "http://cv.iptc.org/newscodes/mediatopic/02000000", "http://cv.iptc.org/newscodes/mediatopic/03000000", "http://cv.iptc.org/newscodes/mediatopic/04000000", "http://cv.iptc.org/newscodes/mediatopic/05000000", "http://cv.iptc.org/newscodes/mediatopic/06000000", "http://cv.iptc.org/newscodes/mediatopic/07000000", "http://cv.iptc.org/newscodes/mediatopic/08000000", "http://cv.iptc.org/newscodes/mediatopic/09000000", "http://cv.iptc.org/newscodes/mediatopic/10000000", "http://cv.iptc.org/newscodes/mediatopic/11000000", "http://cv.iptc.org/newscodes/mediatopic/12000000", "http://cv.iptc.org/newscodes/mediatopic/13000000", "http://cv.iptc.org/newscodes/mediatopic/14000000", "http://cv.iptc.org/newscodes/mediatopic/15000000", "http://cv.iptc.org/newscodes/mediatopic/16000000", "http://cv.iptc.org/newscodes/mediatopic/17000000"
],
Then I loaded the json file in my jupyter notebook with :
import json
df = pd.read_json("cptall-en-GB.json")
This turns out as two columns, one column with indexes and one column with just a long string with all information. The first two examples turn out like following:
{'conceptSet': {0: {'uri': 'http://cv.iptc.org/newscodes/mediatopic/01000000',
'qcode': 'medtop:01000000',
'type': ['http://www.w3.org/2004/02/skos/core#Concept'],
'inScheme': ['http://cv.iptc.org/newscodes/mediatopic/'],
'modified': '2021-02-18T12:00:00 00:00',
'prefLabel': {'en-GB': 'arts, culture, entertainment and media'},
'definition': {'en-GB': 'All forms of arts, entertainment, cultural heritage and media'},
'narrower': ['medtop:20000002', 'medtop:20000038', 'medtop:20000045'],
'exactMatch': ['http://cv.iptc.org/newscodes/subjectcode/01000000'],
'created': '2009-10-22T02:00:00 00:00'},
1: {'uri': 'http://cv.iptc.org/newscodes/mediatopic/02000000',
'qcode': 'medtop:02000000',
'type': ['http://www.w3.org/2004/02/skos/core#Concept'],
'inScheme': ['http://cv.iptc.org/newscodes/mediatopic/'],
'modified': '2021-05-05T12:00:00 00:00',
'prefLabel': {'en-GB': 'crime, law and justice'},
'definition': {'en-GB': 'The establishment and/or statement of the rules of behaviour in society, the enforcement of these rules, breaches of the rules, the punishment of offenders and the organisations and bodies involved in these activities'},
'narrower': ['medtop:20000082',
'medtop:20000106',
'medtop:20000119',
'medtop:20000121',
'medtop:20000129'],
'exactMatch': ['http://cv.iptc.org/newscodes/subjectcode/02000000',
'https://www.wikidata.org/entity/Q146491'],
'created': '2009-10-22T02:00:00 00:00'}}}
Any advice on how to turn this into a better looking dataframe with all fields as columns?
CodePudding user response:
You can remove
"conceptSet":
from the start and
}
from the end. After that try reading again. I've downloaded and imported the dataset. Working fine for me.
My Code:
import pandas as pd
df = pd.read_json(r"C:\Users\hp\Downloads\cptall-en-US.json")
print(df)
My Output:
uri ... hasFacet
0 http://cv.iptc.org/newscodes/mediatopic/01000000 ... NaN
1 http://cv.iptc.org/newscodes/mediatopic/02000000 ... NaN
2 http://cv.iptc.org/newscodes/mediatopic/03000000 ... NaN
3 http://cv.iptc.org/newscodes/mediatopic/04000000 ... NaN
4 http://cv.iptc.org/newscodes/mediatopic/05000000 ... NaN
... ... ... ...
1351 http://cv.iptc.org/newscodes/mediatopic/20001355 ... NaN
1352 http://cv.iptc.org/newscodes/mediatopic/20001356 ... NaN
1353 http://cv.iptc.org/newscodes/mediatopic/20001357 ... NaN
1354 http://cv.iptc.org/newscodes/mediatopic/20001358 ... NaN
1355 http://cv.iptc.org/newscodes/mediatopic/20001359 ... NaN
[1356 rows x 15 columns]
Some data from my JSON file:
[
{
"uri": "http://cv.iptc.org/newscodes/mediatopic/01000000",
"qcode": "medtop:01000000",
"type": [
"http://www.w3.org/2004/02/skos/core#Concept"
],
"inScheme": [
"http://cv.iptc.org/newscodes/mediatopic/"
],
"modified": "2021-02-18T12:00:00 00:00",
"prefLabel": {
"en-US": "arts, culture, entertainment and media"
},
"definition": {
"en-US": "All forms of arts, entertainment, cultural heritage and media"
},
"narrower": [
"medtop:20000002",
"medtop:20000038",
"medtop:20000045"
],
"exactMatch": [
"http://cv.iptc.org/newscodes/subjectcode/01000000"
],
"created": "2009-10-22T02:00:00 00:00"
},
{
"uri": "http://cv.iptc.org/newscodes/mediatopic/02000000",
"qcode": "medtop:02000000",
"type": [
"http://www.w3.org/2004/02/skos/core#Concept"
],
"inScheme": [
"http://cv.iptc.org/newscodes/mediatopic/"
],
"modified": "2021-05-05T12:00:00 00:00",
"prefLabel": {
"en-US": "crime, law and justice"
},
"definition": {
"en-US": "The establishment and/or statement of the rules of behavior in society, the enforcement of these rules, breaches of the rules, the punishment of offenders and the organizations and bodies involved in these activities"
},
"narrower": [
"medtop:20000082",
"medtop:20000106",
"medtop:20000119",
"medtop:20000121",
"medtop:20000129"
],
"exactMatch": [
"http://cv.iptc.org/newscodes/subjectcode/02000000",
"https://www.wikidata.org/entity/Q146491"
],
"created": "2009-10-22T02:00:00 00:00"
}]