I harvested Data using the Unpaywall API of multiple journal articles, and I got the authors' column to contain an array of nested python dictionaries where each dictionary contains some information about the author, what I am trying to do is to get each of these arrays as a row in a dataframe and have each author's information labeled by its indexed. To clarify here is an example of the data I have
Authors |
---|
[{'ORCID': 'http://orcid.org/0000-0001-5586-6889', 'given': 'Ayman', 'family': 'Taha', 'sequence': 'first', 'authenticated-orcid': False}, {'ORCID': 'http://orcid.org/0000-0002-1533-8333', 'given': 'Ali S.', 'family': 'Hadi', 'sequence': 'additional', 'authenticated-orcid': False}, {'given': 'Bernard', 'family': 'Cosgrave', 'sequence': 'additional'}, {'given': 'Susan', 'family': 'McKeever', 'sequence': 'additional'}] |
[{'given': 'Piteesha', 'family': 'Ramlagan', 'sequence': 'first', 'affiliation': [{'name': 'Biopharmaceutical Unit, Centre for Biomedical and Biomaterials Research, MSIRI Building, University of Mauritius, Réduit, Mauritius'}, {'name': 'Department of Health Sciences, Faculty of Medicine and Health Sciences, University of Mauritius, Réduit, Mauritius'}]}, {'given': 'Marwa Yousry', 'family': 'Issa', 'sequence': 'additional', 'affiliation': [{'name': 'Pharmacognosy Department, College of Pharmacy, Cairo University, Cairo, Egypt'}]}, {'given': 'Philippe', 'family': 'Rondeau', 'sequence': 'additional', 'affiliation': [{'name': 'Université de La Réunion, INSERM, UMR 1188 Diabète athérothrombose Thérapies Réunion Océan Indien (DéTROI), Saint-Denis de La Réunion, France'}]}, {'given': 'Emmanuel', 'family': 'Bourdon', 'sequence': 'additional', 'affiliation': [{'name': 'Université de La Réunion, INSERM, UMR 1188 Diabète athérothrombose Thérapies Réunion Océan Indien (DéTROI), Saint-Denis de La Réunion, France'}]}, {'given': 'Theeshan', 'family': 'Bahorun', 'sequence': 'additional', 'affiliation': [{'name': 'Biopharmaceutical Unit, Centre for Biomedical and Biomaterials Research, MSIRI Building, University of Mauritius, Réduit, Mauritius'}, {'name': 'Department of Biosciences and Ocean Studies, Faculty of Science, University of Mauritius, Réduit, Mauritius'}, {'name': 'Mauritius Research Innovation Council, Ebène, Mauritius'}]}, {'ORCID': 'http://orcid.org/0000-0001-5139-1863', 'given': 'Mohamed A.', 'family': 'Farag', 'sequence': 'additional', 'affiliation': [{'name': 'Pharmacognosy Department, College of Pharmacy, Cairo University, Cairo, Egypt'}, {'name': 'Chemistry Department, School of Sciences & Engineering, The American University in Cairo, New Cairo, Egypt'}], 'authenticated-orcid': False}, {'given': 'Vidushi S', 'family': 'Neergheen', 'sequence': 'additional', 'affiliation': [{'name': 'Biopharmaceutical Unit, Centre for Biomedical and Biomaterials Research, MSIRI Building, University of Mauritius, Réduit, Mauritius'}]}] |
What I want to get is something like:
| Given_1 | Family_1 |.....etc for the rest of information for every cell in the above column.
| -------- | -------- |
| Ayman | Taha |
| Piteesha | Ramlagan |
How can I do that with python?
What I tried is to get every cell's content as a dataframe then every row as another dataframe then join them with different suffixes, but apparently I cannot use join on multiple dataframes.
CodePudding user response:
You can try:
# apply ast.literal_eval if necessary
from ast import literal_eval
df["Authors"] = df["Authors"].apply(literal_eval)
df["index"] = range(1, len(df) 1)
df = df.explode("Authors")
df["Given"] = df["Authors"].str["given"]
df["Family"] = df["Authors"].str["family"]
df.pop("Authors")
df["counter2"] = df.groupby("index").cumcount() 1
df = df.pivot(index="index", columns=["counter2"])
df.columns = [f"{a}_{b}" for a, b in df.columns]
print(df[sorted(df, key=lambda c: int(c.split("_")[-1]))])
Prints:
Given_1 Family_1 Given_2 Family_2 Given_3 Family_3 Given_4 Family_4 Given_5 Family_5 Given_6 Family_6 Given_7 Family_7
index
1 Ayman Taha Ali S. Hadi Bernard Cosgrave Susan McKeever NaN NaN NaN NaN NaN NaN
2 Piteesha Ramlagan Marwa Yousry Issa Philippe Rondeau Emmanuel Bourdon Theeshan Bahorun Mohamed A. Farag Vidushi S Neergheen