Home > Software engineering >  Creating a dataframe out of a column where each cell contains an array of nested python dictionaries
Creating a dataframe out of a column where each cell contains an array of nested python dictionaries

Time:01-08

I harvested Data using the Unpaywall API of multiple journal articles, and I got the authors' column to contain an array of nested python dictionaries where each dictionary contains some information about the author, what I am trying to do is to get each of these arrays as a row in a dataframe and have each author's information labeled by its indexed. To clarify here is an example of the data I have

Authors
[{'ORCID': 'http://orcid.org/0000-0001-5586-6889', 'given': 'Ayman', 'family': 'Taha', 'sequence': 'first', 'authenticated-orcid': False}, {'ORCID': 'http://orcid.org/0000-0002-1533-8333', 'given': 'Ali S.', 'family': 'Hadi', 'sequence': 'additional', 'authenticated-orcid': False}, {'given': 'Bernard', 'family': 'Cosgrave', 'sequence': 'additional'}, {'given': 'Susan', 'family': 'McKeever', 'sequence': 'additional'}]
[{'given': 'Piteesha', 'family': 'Ramlagan', 'sequence': 'first', 'affiliation': [{'name': 'Biopharmaceutical Unit, Centre for Biomedical and Biomaterials Research, MSIRI Building, University of Mauritius, Réduit, Mauritius'}, {'name': 'Department of Health Sciences, Faculty of Medicine and Health Sciences, University of Mauritius, Réduit, Mauritius'}]}, {'given': 'Marwa Yousry', 'family': 'Issa', 'sequence': 'additional', 'affiliation': [{'name': 'Pharmacognosy Department, College of Pharmacy, Cairo University, Cairo, Egypt'}]}, {'given': 'Philippe', 'family': 'Rondeau', 'sequence': 'additional', 'affiliation': [{'name': 'Université de La Réunion, INSERM, UMR 1188 Diabète athérothrombose Thérapies Réunion Océan Indien (DéTROI), Saint-Denis de La Réunion, France'}]}, {'given': 'Emmanuel', 'family': 'Bourdon', 'sequence': 'additional', 'affiliation': [{'name': 'Université de La Réunion, INSERM, UMR 1188 Diabète athérothrombose Thérapies Réunion Océan Indien (DéTROI), Saint-Denis de La Réunion, France'}]}, {'given': 'Theeshan', 'family': 'Bahorun', 'sequence': 'additional', 'affiliation': [{'name': 'Biopharmaceutical Unit, Centre for Biomedical and Biomaterials Research, MSIRI Building, University of Mauritius, Réduit, Mauritius'}, {'name': 'Department of Biosciences and Ocean Studies, Faculty of Science, University of Mauritius, Réduit, Mauritius'}, {'name': 'Mauritius Research Innovation Council, Ebène, Mauritius'}]}, {'ORCID': 'http://orcid.org/0000-0001-5139-1863', 'given': 'Mohamed A.', 'family': 'Farag', 'sequence': 'additional', 'affiliation': [{'name': 'Pharmacognosy Department, College of Pharmacy, Cairo University, Cairo, Egypt'}, {'name': 'Chemistry Department, School of Sciences & Engineering, The American University in Cairo, New Cairo, Egypt'}], 'authenticated-orcid': False}, {'given': 'Vidushi S', 'family': 'Neergheen', 'sequence': 'additional', 'affiliation': [{'name': 'Biopharmaceutical Unit, Centre for Biomedical and Biomaterials Research, MSIRI Building, University of Mauritius, Réduit, Mauritius'}]}]

What I want to get is something like:

| Given_1  | Family_1 |.....etc for the rest of information for every cell in the above column.
| -------- | -------- |
|  Ayman   | Taha     |
| Piteesha | Ramlagan |

How can I do that with python?

What I tried is to get every cell's content as a dataframe then every row as another dataframe then join them with different suffixes, but apparently I cannot use join on multiple dataframes.

CodePudding user response:

You can try:

# apply ast.literal_eval if necessary
from ast import literal_eval
df["Authors"] = df["Authors"].apply(literal_eval)


df["index"] = range(1, len(df)   1)
df = df.explode("Authors")

df["Given"] = df["Authors"].str["given"]
df["Family"] = df["Authors"].str["family"]

df.pop("Authors")

df["counter2"] = df.groupby("index").cumcount()   1
df = df.pivot(index="index", columns=["counter2"])
df.columns = [f"{a}_{b}" for a, b in df.columns]
print(df[sorted(df, key=lambda c: int(c.split("_")[-1]))])

Prints:

        Given_1  Family_1       Given_2 Family_2   Given_3  Family_3   Given_4  Family_4   Given_5 Family_5     Given_6 Family_6    Given_7   Family_7
index                                                                                                                                                 
1         Ayman      Taha        Ali S.     Hadi   Bernard  Cosgrave     Susan  McKeever       NaN      NaN         NaN      NaN        NaN        NaN
2      Piteesha  Ramlagan  Marwa Yousry     Issa  Philippe   Rondeau  Emmanuel   Bourdon  Theeshan  Bahorun  Mohamed A.    Farag  Vidushi S  Neergheen
  • Related