To summarize as concisely as I can, I have data file containing a list of chemical compounds along with their ID numbers ("CID" numbers). My goal is to use pubchempy's pubchempy.get_properties function along with pandas' df.map function to essentially obtain the properties of each compound (there is one compound per row) using the "CID" number as an identifier. The parameters of pubchempy.get_properties is an identifier ("CID" number in this case) along with the property of the chemical that you want to obtain from the pubchem website (Molecular weight in this case).
This is the code that I have written currently:
import pandas as pd
import pubchempy
import numpy as np
df = pd.read_csv("Data.tsv.txt", sep="\t")
from pubchempy import get_properties
df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('.0',''))
df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('0',''))
df = df.drop(df[df.CID=='nan'].index)
df = df.drop( df.index.to_list()[5:] ,axis = 0 )
df['CID']= df['CID'].map(lambda x: get_properties(identifier=x, properties='MolecularWeight') if float(x) > 0 else pd.NA)
df = df.rename(columns={'CID.': 'MolecularWeight'})
print(df)
This is the output that I was initially getting for that column (only including a few rows, in reality, dataset is very big):
MolecularWeight
[{'CID': 5339, 'MolecularWeight': '398.4'}]
[{'CID': 3889, 'MolecularWeight': '520.5'}]
[{'CID': 2788, 'MolecularWeight': '305.50'}]
[{'CID': 1422517, 'MolecularWeight': '440.5'}]
.
.
.
Now, the code was somewhat working in that it is providing me with the molecular weight of the compound (398.4) but I didn't want all that extra bit of writing nor did I want the quote marks around the molecular weight number (both of these get in the way of the next bit of code that I plan to write).
So I then added this bit of code:
df['MolecularWeight'] = df.MolecularWeight[0][0].get('MolecularWeight')
This is the output that I am now getting:
MolecularWeight
398.4
398.4
398.4
398.4
.
.
.
What I want to do is pretty much exactly the same it's just that instead of getting the molecular weight of the first row in the MolecularWeight column and copying it onto all the other rows, I want to have the molecular weight value of each individual row in that column as the output.
What I was hoping to get is something like this:
MolecularWeight
398.4
520.5
305.50
440.5
.
.
.
Does anyone know how I can solve this issue? I've spent many hours trying to figure it out myself with no luck. I'd appreciate any help!
Few lines of text file:
NO. compound_name IUPAC_name SMILES CID Inchi threshold reference group comments
1 sulphasalazine 2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]diazenyl]benzoic acid O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O 5339 InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18(24)25)21-20-12-4-7-14(8-5-12)28(26,27)22-17-3-1-2-10-19-17/h1-11,23H,(H,19,22)(H,24,25) R2|R2|R25|R46| A
2 moxalactam 7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]-7-methoxy-3-[(1-methyltetrazol-5-yl)sulfanylmethyl]-8-oxo-5-oxa-1-azabicyclo[4.2.0]oct-2-ene-2-carboxylic acid COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)O)=C(CSc3nnnn3C)COC21 3889 InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8-10-7-35-18-20(34-2,17(33)26(18)13(10)16(31)32)21-14(28)12(15(29)30)9-3-5-11(27)6-4-9/h3-6,12,18,27H,7-8H2,1-2H3,(H,21,28)(H,29,30)(H,31,32) R25| A
3 clioquinol 5-chloro-7-iodoquinolin-8-ol Oc1c(I)cc(Cl)c2cccnc12 2788 InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1-3-12-8/h1-4,13H R18|R26|R27| A
CodePudding user response:
If you cast the column to float, that should help you: df['MolecularWeight'] = df['MolecularWeight'].astype(float)
.
CodePudding user response:
Try:
df.MolecularWeight = pd.to_numeric(df.MolecularWeight.str[0].str.get('MolecularWeight'))
Input/Output:
MolecularWeight
0 [{'CID': 5339, 'MolecularWeight': '398.4'}]
1 [{'CID': 3889, 'MolecularWeight': '520.5'}]
2 [{'CID': 2788, 'MolecularWeight': '305.50'}]
3 [{'CID': 1422517, 'MolecularWeight': '440.5'}]
MolecularWeight
0 398.4
1 520.5
2 305.5
3 440.5