I am looking to split a dataframe column that contains a string of a dictionary into separate columns. I've seen a few methods, but I want to avoid splitting the string since there are some inconsistencies. For instance, "Melting Point" sometimes takes the place of "Boiling Point", but I do not want melting point and boiling point to be in the same column.
Here is the column I am trying to split.
#example below
data = [
'''[{'name': 'Boiling Point', 'property': '115.3 °C', 'sourceNumber': 1}]''',
'''[{'name': 'Boiling Point', 'property': '91 °C @ Press: 20 Torr', 'sourceNumber': 1}]''',
'''[{'name': 'Boiling Point', 'property': '58 °C @ Press: 12 Torr', 'sourceNumber': 1}, {'name': 'Density', 'property': '0.8753 g/cm<sup>3</sup> @ Temp: 20 °C', 'sourceNumber': 1}]''']
df = pd.DataFrame(data, columns=['experimental_properties'])
I want it to look like this first row:
I tried a method from here to no avail: How to convert JSON data inside a pandas column into new columns
pd.io.json.json_normalize(df.experimental_properties.apply(json.loads))
Help is much appreciated!
CodePudding user response:
I hope I've understood your question well. Try:
from ast import literal_eval
df["experimental_properties"] = df["experimental_properties"].apply(
lambda x: {d["name"]: d["property"] for d in literal_eval(x)}
)
df = pd.concat([df, df.pop("experimental_properties").apply(pd.Series)], axis=1)
print(df)
Prints:
Boiling Point Density
0 115.3 °C NaN
1 91 °C @ Press: 20 Torr NaN
2 58 °C @ Press: 12 Torr 0.8753 g/cm<sup>3</sup> @ Temp: 20 °C
CodePudding user response:
Is the expected output really what you are looking for? Another way to visualise the data would be to have "name", "property", and "sourceNumber" as column names.
import json
import pandas as pd
data = [
'''[{'name': 'Boiling Point', 'property': '115.3 °C', 'sourceNumber': 1}]''',
'''[{'name': 'Boiling Point', 'property': '91 °C @ Press: 20 Torr', 'sourceNumber': 1}]''',
'''[{'name': 'Boiling Point', 'property': '58 °C @ Press: 12 Torr', 'sourceNumber': 1}, {'name': 'Density', 'property': '0.8753 g/cm<sup>3</sup> @ Temp: 20 °C', 'sourceNumber': 1}]''']
#Initialise a naiveList
naiveList = []
#String to List
for i in data:
tempStringOfData = i
tempStringOfData = tempStringOfData.replace("\'", "\"")
tempJsonData = json.loads(tempStringOfData)
naiveList.append(tempJsonData)
#Initialise a List for Dictionaries
newListOfDictionaries = []
for i in naiveList:
for j in i:
newListOfDictionaries.append(j)
df = pd.DataFrame(newListOfDictionaries)
print(df)
Which gives you
name property sourceNumber
0 Boiling Point 115.3 °C 1
1 Boiling Point 91 °C @ Press: 20 Torr 1
2 Boiling Point 58 °C @ Press: 12 Torr 1
3 Density 0.8753 g/cm<sup>3</sup> @ Temp: 20 °C 1