Home > Software engineering >  How to use pandas' df.get function for a dataframe column so that each row in the column mainta
How to use pandas' df.get function for a dataframe column so that each row in the column mainta

Time:05-30

To summarize as concisely as I can, I have data file containing a list of chemical compounds along with their ID numbers ("CID" numbers). My goal is to use pubchempy's pubchempy.get_properties function along with pandas' df.map function to essentially obtain the properties of each compound (there is one compound per row) using the "CID" number as an identifier. The parameters of pubchempy.get_properties is an identifier ("CID" number in this case) along with the property of the chemical that you want to obtain from the pubchem website (Molecular weight in this case).


This is the code that I have written currently:

import pandas as pd

import pubchempy

import numpy as np

df = pd.read_csv("Data.tsv.txt", sep="\t")

from pubchempy import get_properties

df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('.0',''))

df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('0',''))

df = df.drop(df[df.CID=='nan'].index)

df = df.drop( df.index.to_list()[5:] ,axis = 0 )

df['CID']= df['CID'].map(lambda x: get_properties(identifier=x, properties='MolecularWeight') if float(x) > 0 else pd.NA)

df = df.rename(columns={'CID.': 'MolecularWeight'})

print(df)

This is the output that I was initially getting for that column (only including a few rows, in reality, dataset is very big):

MolecularWeight

[{'CID': 5339, 'MolecularWeight': '398.4'}]

[{'CID': 3889, 'MolecularWeight': '520.5'}]

[{'CID': 2788, 'MolecularWeight': '305.50'}]

[{'CID': 1422517, 'MolecularWeight': '440.5'}]

.

.

.


Now, the code was somewhat working in that it is providing me with the molecular weight of the compound (398.4) but I didn't want all that extra bit of writing nor did I want the quote marks around the molecular weight number (both of these get in the way of the next bit of code that I plan to write).

So I then added this bit of code:

df['MolecularWeight'] = df.MolecularWeight[0][0].get('MolecularWeight')

This is the output that I am now getting:

MolecularWeight

398.4

398.4

398.4

398.4

.

.

.


What I want to do is pretty much exactly the same it's just that instead of getting the molecular weight of the first row in the MolecularWeight column and copying it onto all the other rows, I want to have the molecular weight value of each individual row in that column as the output.

What I was hoping to get is something like this:

MolecularWeight

398.4

520.5

305.50

440.5

.

.

.

Does anyone know how I can solve this issue? I've spent many hours trying to figure it out myself with no luck. I'd appreciate any help!


Few lines of text file:

NO. compound_name   IUPAC_name  SMILES  CID     Inchi   threshold   reference   group   comments
1   sulphasalazine  2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]diazenyl]benzoic acid O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O    5339    InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18(24)25)21-20-12-4-7-14(8-5-12)28(26,27)22-17-3-1-2-10-19-17/h1-11,23H,(H,19,22)(H,24,25)      R2|R2|R25|R46|  A   
2   moxalactam  7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]-7-methoxy-3-[(1-methyltetrazol-5-yl)sulfanylmethyl]-8-oxo-5-oxa-1-azabicyclo[4.2.0]oct-2-ene-2-carboxylic acid   COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)O)=C(CSc3nnnn3C)COC21 3889    InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8-10-7-35-18-20(34-2,17(33)26(18)13(10)16(31)32)21-14(28)12(15(29)30)9-3-5-11(27)6-4-9/h3-6,12,18,27H,7-8H2,1-2H3,(H,21,28)(H,29,30)(H,31,32)      R25|    A   
3   clioquinol  5-chloro-7-iodoquinolin-8-ol    Oc1c(I)cc(Cl)c2cccnc12  2788    InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1-3-12-8/h1-4,13H      R18|R26|R27|    A   

CodePudding user response:

If you cast the column to float, that should help you: df['MolecularWeight'] = df['MolecularWeight'].astype(float).

CodePudding user response:

Try:

df.MolecularWeight = pd.to_numeric(df.MolecularWeight.str[0].str.get('MolecularWeight'))

Input/Output:

                                  MolecularWeight
0     [{'CID': 5339, 'MolecularWeight': '398.4'}]
1     [{'CID': 3889, 'MolecularWeight': '520.5'}]
2    [{'CID': 2788, 'MolecularWeight': '305.50'}]
3  [{'CID': 1422517, 'MolecularWeight': '440.5'}]

   MolecularWeight
0            398.4
1            520.5
2            305.5
3            440.5
  • Related