I use python and pandas to analyze big data set. I have a several arrays with different length. I need to insert values to specific column. If some values are not present for column it should be 'not defined'. Input data looks like row in dataframe with different positions.
Expected output:
Examples of input data:
# Example 1
{'Water Solubility': 'Insoluble ', 'Melting Point': '135-138 °C', 'logP': '4.68'}
# Example 2
{'Melting Point': '71 °C (whole mAb)', 'Hydrophobicity': '-0.529', 'Isoelectric Point': '7.89', 'Molecular Weight': '51234.9', 'Molecular Formula': 'C2224H3475N621O698S36'}
# Example 3
{'Water Solubility': '1E 006 mg/L (at 25 °C)', 'Melting Point': '204-205 °C', 'logP': '1.1', 'pKa': '6.78'}
I have tried to add to array 'Not defined' but I couldn't find the right approach
CodePudding user response:
I think the best way is to just create a dataframe for each dict, and then concat the dataframes.
d_1 = {'Water Solubility': 'Insoluble ', 'Melting Point': '135-138 °C', 'logP': '4.68'}
d_2 = {'Melting Point': '71 °C (whole mAb)', 'Hydrophobicity': '-0.529', 'Isoelectric Point': '7.89', 'Molecular Weight': '51234.9', 'Molecular Formula': 'C2224H3475N621O698S36'}
df_1 = pd.DataFrame([d_1], columns=d_1.keys())
df_2 = pd.DataFrame([d_2], columns=d_2.keys())
final_df = pd.concat([df_1, df_2]).fillna(0)
You can build a function that takes a list of dicts and returns a final dataframe
CodePudding user response:
This should do what you're asking:
import pandas as pd
import numpy as np
# Example 1
ex1 = {'Water Solubility': 'Insoluble ', 'Melting Point': '135-138 °C', 'logP': '4.68'}
# Example 2
ex2 = {'Melting Point': '71 °C (whole mAb)', 'Hydrophobicity': '-0.529', 'Isoelectric Point': '7.89', 'Molecular Weight': '51234.9', 'Molecular Formula': 'C2224H3475N621O698S36'}
# Example 3
ex3 = {'Water Solubility': '1E 006 mg/L (at 25 °C)', 'Melting Point': '204-205 °C', 'logP': '1.1', 'pKa': '6.78'}
df = pd.DataFrame({
'Boiling Point':[162-165, 'Not defined'],
'Hydrophobicity':[-0.5227, -0.427],
'Isoelectric Point':[9.02, 12.02],
'Melting Point':[1000.0, 'Not defined'],
'Molecular Formula':['C1970H3848N50O947S4', 'Not defined'],
'Molecular Weight':[9.23, 7.13],
'Radioactivity':['Practically insoluble', 'Not defined'],
'Water Solubility':[1.23, 2.87],
'caco2 Permeability':['63.6±55.0', 901],
'logP':[14, 14],
'logS':[0.618, 0.238],
'pKa':['Not defined', 'Not defined']
})
df = pd.concat([df, pd.DataFrame([ex1, ex2, ex3])], ignore_index=True)
df.iloc[-3:] = df.iloc[-3:].apply(lambda x: ['Not defined' if str(v)=='nan' else v for v in x])
print(df.to_string())
Output:
Boiling Point Hydrophobicity Isoelectric Point Melting Point Molecular Formula Molecular Weight Radioactivity Water Solubility caco2 Permeability logP logS pKa
0 -3 -0.5227 9.02 1000.0 C1970H3848N50O947S4 9.23 Practically insoluble 1.23 63.6±55.0 14 0.618 Not defined
1 Not defined -0.427 12.02 Not defined Not defined 7.13 Not defined 2.87 901 14 0.238 Not defined
2 Not defined Not defined Not defined 135-138 °C Not defined Not defined Not defined Insoluble Not defined 4.68 Not defined Not defined
3 Not defined -0.529 7.89 71 °C (whole mAb) C2224H3475N621O698S36 51234.9 Not defined Not defined Not defined Not defined Not defined Not defined
4 Not defined Not defined Not defined 204-205 °C Not defined Not defined Not defined 1E 006 mg/L (at 25 °C) Not defined 1.1 Not defined 6.78