Insert to pandas dataframe value to specific column-CodePudding

I use python and pandas to analyze big data set. I have a several arrays with different length. I need to insert values to specific column. If some values are not present for column it should be 'not defined'. Input data looks like row in dataframe with different positions. Expected output:
Examples of input data:

# Example 1
{'Water Solubility': 'Insoluble ', 'Melting Point': '135-138 °C', 'logP': '4.68'}

# Example 2
{'Melting Point': '71 °C (whole mAb)', 'Hydrophobicity': '-0.529', 'Isoelectric Point': '7.89', 'Molecular Weight': '51234.9', 'Molecular Formula': 'C2224H3475N621O698S36'}

# Example 3
{'Water Solubility': '1E 006 mg/L (at 25 °C)', 'Melting Point': '204-205 °C', 'logP': '1.1', 'pKa': '6.78'}

I have tried to add to array 'Not defined' but I couldn't find the right approach

CodePudding user response：

I think the best way is to just create a dataframe for each dict, and then concat the dataframes.

d_1 = {'Water Solubility': 'Insoluble ', 'Melting Point': '135-138 °C', 'logP': '4.68'}
    
d_2 =  {'Melting Point': '71 °C (whole mAb)', 'Hydrophobicity': '-0.529', 'Isoelectric Point': '7.89', 'Molecular Weight': '51234.9', 'Molecular Formula': 'C2224H3475N621O698S36'}

df_1 = pd.DataFrame([d_1], columns=d_1.keys()) 

df_2 = pd.DataFrame([d_2], columns=d_2.keys())

final_df = pd.concat([df_1, df_2]).fillna(0)

You can build a function that takes a list of dicts and returns a final dataframe

CodePudding user response：

This should do what you're asking:

import pandas as pd
import numpy as np

# Example 1
ex1 = {'Water Solubility': 'Insoluble ', 'Melting Point': '135-138 °C', 'logP': '4.68'}

# Example 2
ex2 = {'Melting Point': '71 °C (whole mAb)', 'Hydrophobicity': '-0.529', 'Isoelectric Point': '7.89', 'Molecular Weight': '51234.9', 'Molecular Formula': 'C2224H3475N621O698S36'}

# Example 3
ex3 = {'Water Solubility': '1E 006 mg/L (at 25 °C)', 'Melting Point': '204-205 °C', 'logP': '1.1', 'pKa': '6.78'}


df = pd.DataFrame({
    'Boiling Point':[162-165, 'Not defined'],
    'Hydrophobicity':[-0.5227, -0.427],
    'Isoelectric Point':[9.02, 12.02],
    'Melting Point':[1000.0, 'Not defined'],
    'Molecular Formula':['C1970H3848N50O947S4', 'Not defined'],
    'Molecular Weight':[9.23, 7.13],
    'Radioactivity':['Practically insoluble', 'Not defined'],
    'Water Solubility':[1.23, 2.87],
    'caco2 Permeability':['63.6±55.0', 901],
    'logP':[14, 14],
    'logS':[0.618, 0.238],
    'pKa':['Not defined', 'Not defined']
})

df = pd.concat([df, pd.DataFrame([ex1, ex2, ex3])], ignore_index=True)
df.iloc[-3:] = df.iloc[-3:].apply(lambda x: ['Not defined' if str(v)=='nan' else v for v in x])
print(df.to_string())

Output:

  Boiling Point Hydrophobicity Isoelectric Point      Melting Point      Molecular Formula Molecular Weight          Radioactivity        Water Solubility caco2 Permeability         logP         logS          pKa
0            -3        -0.5227              9.02             1000.0    C1970H3848N50O947S4             9.23  Practically insoluble                    1.23          63.6±55.0           14        0.618  Not defined
1   Not defined         -0.427             12.02        Not defined            Not defined             7.13            Not defined                    2.87                901           14        0.238  Not defined
2   Not defined    Not defined       Not defined         135-138 °C            Not defined      Not defined            Not defined              Insoluble         Not defined         4.68  Not defined  Not defined
3   Not defined         -0.529              7.89  71 °C (whole mAb)  C2224H3475N621O698S36          51234.9            Not defined             Not defined        Not defined  Not defined  Not defined  Not defined
4   Not defined    Not defined       Not defined         204-205 °C            Not defined      Not defined            Not defined  1E 006 mg/L (at 25 °C)        Not defined          1.1  Not defined         6.78