Home > front end >  Regex for variable length
Regex for variable length

Time:06-23

I am looking for a regex or another command/workaround to extract all pkA values from a very large list for hundred of chemicals. So far, I have managed to extract the desired pkA values from a subset of my list.

I wonder however if it is also possible to extract the whole lines that contain the pkAs? I figured since they all have a rather comparable length, you could extract these with a regex but I don't know how to implement the length inside the regex in combinations with the specific lines containing the pkA values? The reason why I wonder this is because my regex does not include pkAs that start with a 0. Chemicals like this are uncommon but they do exist. By extracting the whole line, I would also catch the few entries that give a temperature value which my regex is not including.

Down below is a (hopefully) minimal working example with an extract of my list.

library(stringr)
list_pkas <- structure(list(Chemical = c("MCPA", "Aspirin"), pka = c("3.2.13Dissociation Constants\r\npKa= 3.13\r\nCessna AJ, Grover R; J Agric Food Chem 26: 289-92(1978)\r\nHazardous Substances Data Bank (HSDB)", 
                                                                     "3.2.14Dissociation Constants\r\nAcidic pKa\r\n3.47\r\nTested as SID 103164874 in AID 781325: https://pubchem.ncbi.nlm.nih.gov/bioassay/781325#sid=103164874\r\nComparison of the accuracy of experimental and predicted pKa values of basic and acidic compounds. Pharm Res. 2014; 31(4):1082-95. DOI:10.1007/s11095-013-1232-z. PMID:24249037\r\nChEMBL\r\nAcidic pKa\r\n3.5\r\nTested as SID 103164874 in AID 781326: https://pubchem.ncbi.nlm.nih.gov/bioassay/781326#sid=103164874\r\nComparison of the accuracy of experimental and predicted pKa values of basic and acidic compounds. Pharm Res. 2014; 31(4):1082-95. DOI:10.1007/s11095-013-1232-z. PMID:24249037\r\nChEMBL; DrugBank\r\npKa = 3.49 at 25 °C\r\nO'Neil, M.J. (ed.). The Merck Index - An Encyclopedia of Chemicals, Drugs, and Biologicals. Whitehouse Station, NJ: Merck and Co., Inc., 2006., p. 140\r\nHazardous Substances Data Bank (HSDB)"
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))

string <- list_pkas$pka[2]
string_sub <- str_sub(string, 7)
pkas <- str_extract_all(string_sub, "([1-9]\\.[0-9]{1,2})")

The expected output should be for MCPA:

3.13

or

pKa=3.13

For Aspirin:

3.47
3.5
pKa = 3.49 at 25 °C

Any help is much appreciated!

CodePudding user response:

You can use the lookbehind assertion (?<=foo):

str_extract_all(list_pkas$pka, "(?<=pKa\\D{0,5})\\d.*")

# [[1]]
# [1] "3.13"
# 
# [[2]]
# [1] "3.47"          "3.5"           "3.49 at 25 °C"

CodePudding user response:

I think that this expression might do what you need:

"pKa\\D{0,5}((?:\\s*\\d \\.*\\d*)(?:\\s*at\\s*\\d \\s*.*?\\w)*)"

  • Related