Home > Software design >  Transform a raw json column of pandas df into more columns
Transform a raw json column of pandas df into more columns

Time:11-30

in my pandas dataframe I have a column which follows a simple pattern:

{'author_position': 'first',
  'author': {'id': 'https://openalex.org/A3003121718',
   'display_name': 'Chaolin Huang',
   'orcid': None},
  'institutions': [{'id': None,
    'display_name': 'Jin Yin-tan Hospital, Wuhan, China',
    'ror': None,
    'country_code': None,
    'type': None}],
  'raw_affiliation_string': 'Jin Yin-tan Hospital, Wuhan, China'},

and repeats the latter for each author of a certain paper. For instance, the first paper in my database has several authors and its authorships column look like this:

df['authorships'][0]

### Output:
[{'author_position': 'first',
  'author': {'id': 'https://openalex.org/A3003121718',
   'display_name': 'Chaolin Huang',
   'orcid': None},
  'institutions': [{'id': None,
    'display_name': 'Jin Yin-tan Hospital, Wuhan, China',
    'ror': None,
    'country_code': None,
    'type': None}],
  'raw_affiliation_string': 'Jin Yin-tan Hospital, Wuhan, China'},
 {'author_position': 'middle',
  'author': {'id': 'https://openalex.org/A3006261277',
   'display_name': 'Yeming Wang',
   'orcid': None},
  'institutions': [{'id': 'https://openalex.org/I2801051648',
    'display_name': 'China-Japan Friendship Hospital',
    'ror': 'https://ror.org/037cjxp13',
    'country_code': 'CN',
    'type': 'healthcare'}],
  'raw_affiliation_string': 'Department of Pulmonary and Critical Care Medicine, Center of Respiratory Medicine, National Clinical Research Center for Respiratory Diseases, China-Japan Friendship Hospital, Beijing, China.'},
 {'author_position': 'middle',
  'author': {'id': 'https://openalex.org/A2620960243',
   'display_name': 'Xingwang Li',
   'orcid': None},
  'institutions': [{'id': 'https://openalex.org/I4210150338',
    'display_name': 'Beijing Ditan Hospital',
    'ror': 'https://ror.org/05kkkes98',
    'country_code': 'CN',
    'type': 'healthcare'}],
  'raw_affiliation_string': 'Clinical and Research Center of Infectious Diseases Beijing Ditan Hospital Capital Medical University Beijing China'},
 {'author_position': 'middle',
  'author': {'id': 'https://openalex.org/A2103212470',
   'display_name': 'Lili Ren',
   'orcid': 'https://orcid.org/0000-0002-6645-8183'},
  'institutions': [{'id': None,
    'display_name': 'NHC Key Laboratory of Systems Biology of Pathogens and Christophe Mérieux Laboratory, Institute of Pathogen Biology, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China.',
    'ror': None,
    'country_code': None,
    'type': None}],
  'raw_affiliation_string': 'NHC Key Laboratory of Systems Biology of Pathogens and Christophe Mérieux Laboratory, Institute of Pathogen Biology, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China.'},
 {'author_position': 'middle',
  'author': {'id': 'https://openalex.org/A2582133136',
   'display_name': 'Jianping Zhao',
   'orcid': None},
  'institutions': [{'id': 'https://openalex.org/I79431787',
    'display_name': 'Tongji Medical College',
    'ror': None,
    'country_code': 'CN',
    'type': None}],
  'raw_affiliation_string': 'Tongji Hospital, Tongji medical college, Huazhong university of Science and Technology, Wuhan, China'},
 {'author_position': 'middle',
  'author': {'id': 'https://openalex.org/A2550526349',
   'display_name': 'Yi Hu',
   'orcid': None},
  'institutions': [{'id': 'https://openalex.org/I47720641',
    'display_name': 'Huazhong University of Science and Technology',
    'ror': 'https://ror.org/00p991c53',
    'country_code': 'CN',
    'type': 'education'}],
  'raw_affiliation_string': 'Department of Pulmonary and Critical Care Medicine, The Central Hospital of Wuhan, Tongji Medical College, Huazhong University of Science and Technology , Wuhan, China'},
 {'author_position': 'middle',
  'author': {'id': 'https://openalex.org/A3197971936',
   'display_name': 'Li Zhang',
   'orcid': 'https://orcid.org/0000-0002-7615-4976'},
  'institutions': [{'id': None,
    'display_name': 'Jin Yin-tan Hospital, Wuhan, China',
    'ror': None,
    'country_code': None,
    'type': None}],
  'raw_affiliation_string': 'Jin Yin-tan Hospital, Wuhan, China'},
 {'author_position': 'middle',
  'author': {'id': 'https://openalex.org/A2911488157',
   'display_name': 'Guohui Fan',
   'orcid': None},
  'institutions': [{'id': 'https://openalex.org/I2801051648',
    'display_name': 'China-Japan Friendship Hospital',
    'ror': 'https://ror.org/037cjxp13',
    'country_code': 'CN',
    'type': 'healthcare'}],
  'raw_affiliation_string': 'Department of Pulmonary and Critical Care Medicine, Center of Respiratory Medicine, National Clinical Research Center for Respiratory Diseases, China-Japan Friendship Hospital, Beijing, China.'},
 {'author_position': 'middle',
  'author': {'id': 'https://openalex.org/A3001214061',
   'display_name': 'Jiuyang Xu',
   'orcid': 'https://orcid.org/0000-0002-1906-5918'},
  'institutions': [{'id': 'https://openalex.org/I99065089',
    'display_name': 'Tsinghua University',
    'ror': 'https://ror.org/03cve4549',
    'country_code': 'CN',
    'type': 'education'}],
  'raw_affiliation_string': 'Tsinghua University,School of Medicine,Beijing,China'},
 {'author_position': 'middle',
  'author': {'id': 'https://openalex.org/A3006530843',
   'display_name': 'Xiaoying Gu',
   'orcid': None},
  'institutions': [{'id': 'https://openalex.org/I2801051648',
    'display_name': 'China-Japan Friendship Hospital',
    'ror': 'https://ror.org/037cjxp13',
    'country_code': 'CN',
    'type': 'healthcare'}],
  'raw_affiliation_string': 'Department of Pulmonary and Critical Care Medicine, Center of Respiratory Medicine, National Clinical Research Center for Respiratory Diseases, China-Japan Friendship Hospital, Beijing, China.'},
 {'author_position': 'middle',
  'author': {'id': 'https://openalex.org/A3205428521',
   'display_name': 'Zhenshun Cheng',
   'orcid': None},
  'institutions': [{'id': 'https://openalex.org/I4210120234',
    'display_name': 'Zhongnan Hospital of Wuhan University',
    'ror': 'https://ror.org/01v5mqw79',
    'country_code': 'CN',
    'type': 'healthcare'}],
  'raw_affiliation_string': 'Department of Respiratory Medicine, Zhongnan Hospital of Wuhan University, Wuhan, China'},
 {'author_position': 'middle',
  'author': {'id': 'https://openalex.org/A2498193827',
   'display_name': 'Ting Yu',
   'orcid': None},
  'institutions': [{'id': None,
    'display_name': 'Jin Yin-tan Hospital, Wuhan, China',
    'ror': None,
    'country_code': None,
    'type': None}],
  'raw_affiliation_string': 'Jin Yin-tan Hospital, Wuhan, China'}]

Now my aim is actually to take only some of the info contained in the above, namely the name of the unique authors' and institutions' records and create two columns containing a list with the author names and the institutions name. In the above, specifically, the result should be the construction of two columns "authors" and "institutions" looking like this (for what concerns the first paper):

df['authors][0]
['Chaolin Huang','Yeming Wang','Xingwang Li','Lili Ren','Jianping Zhao','Yi Hu','Li Zhang','Guohui Fan','Jiuyang Xu','Xiaoying Gu','Zhenshun Cheng','Ting Yu']
df['institutions'][0]
['Jin Yin-tan Hospital, Wuhan, China','China-Japan Friendship Hospital','Beijing Ditan Hospital','Tsinghua University','Zhongnan Hospital of Wuhan University','Jin Yin-tan Hospital, Wuhan, China']

Please notice that doubles (e.g. 'China-Japan Friendship Hospital') are not repeated in the list.

Thank you

CodePudding user response:

you can check if the following code works!

df = pd.DataFrame()
df['authors'] = pd.json_normalize(j)['author.display_name']
df['institutions'] = pd.json_normalize(j, record_path=['institutions'])['display_name']
df
  • Related