I have downloaded a CSV file from Pubchem containing over 5000 records. One of the columns contains a bunch of computed synonyms where CAS Number is the records I wish to extract. Unfortunately, the CAS number isn't necessarily in the same position in this list, making splitting by delimiter more difficult. Below is the source data example and the desired output I am trying to achieve.
An older answer to a post a while back used a Regex function to extract strings of Numbers with a given length.
fnRegexExtr
let fx=(text,regex)=>
Web.Page(
"<script>
var x='"&text&"';
var y=new RegExp('"®ex&"','g');
var b=x.match(y);
document.write(b);
</script>")[Data]{0}[Children]{0}[Children]{1}[Text]{0}
in
fx
Unsure if this is possible here and unfamiliar with Regex but I'm wondering if it is possible to modify this function to extract CAS numbers. The difficulty is that CAS Numbers can be in various formats CAS Numbers are up to 10 digits long using the format xxxxxxx-yy-z.
If anyone has any alternative solutions to extracting CAS numbers with this somewhat complex data feel free to post.
Data:
cid and cmpdname can be anything.
1-Aminopropan-2-ol|1-AMINO-2-PROPANOL|78-96-6|Isopropanolamine|Monoisopropanolamine
1-chloro-2,4-dinitrobenzene|2,4-Dinitrochlorobenzene|97-00-7|Dinitrochlorobenzene|DNCB|Chlorodinitrobenzene|CDNB
1,2-dichloroethane|Ethylene dichloride|107-06-2|Ethylene chloride|Ethane, 1,2-dichloro-|Glycol dichloride|Dutch liquid|Dutch oil|Ethane dichloride|Aethylenchloride
1,2,4-trichlorobenzene|120-82-1|Benzene, 1,2,4-trichloro-|unsym-Trichlorobenzene|Hostetex L-pec|Trojchlorobenzene
CHLOROACETALDEHYDE|2-chloroacetaldehyde|107-20-0|Chloroethanal|2-Chloroethanal|Acetaldehyde, chloro-|Chloroaldehyde|Monochloroacetaldehyde|2-Chloro-1-ethanal
CodePudding user response:
In PQ, this will pull out the contents of any item that does not contain a letter in cmpdsynonym, which I think is basically what you are looking for
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"cmpdsynonym", type text}}),
#"Added Custom1" = Table.AddColumn(#"Changed Type", "Custom.1", each
Text.Split([cmpdsynonym],"|"){
List.PositionOf(
List.Transform(
Text.Split([cmpdsynonym],"|")
, each List.ContainsAny({_}, {"a".."z","A".."Z"}))
,true)
-1
}
)
in #"Added Custom1"
CodePudding user response:
Here's one way of doing it in PQ, using fnRegexExtr
to return the CAS; and a simple Text.Split
to return the chemical compound name:
let
//Read in data and set data type as text
Source = Excel.CurrentWorkbook(){[Name="Compounds"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
//Transform to desired output
Result = Table.FromColumns(
{List.Transform(#"Changed Type"[Column1], each Text.Split(_,"|")){0}}
& {List.Transform(#"Changed Type"[Column1],each fnRegexExtr(_, "\\b\\d{1,7}-\\d{2}-\\d"))},
type table[Compound=text, CAS=text]
)
in
Result