Home > Blockchain >  Extract CAS Number from Downloaded Data
Extract CAS Number from Downloaded Data

Time:05-04

I have downloaded a CSV file from Pubchem containing over 5000 records. One of the columns contains a bunch of computed synonyms where CAS Number is the records I wish to extract. Unfortunately, the CAS number isn't necessarily in the same position in this list, making splitting by delimiter more difficult. Below is the source data example and the desired output I am trying to achieve.

enter image description here

An older answer to a post a while back used a Regex function to extract strings of Numbers with a given length.

fnRegexExtr

let   fx=(text,regex)=>
    Web.Page(
        "<script>
            var x='"&text&"';
            var y=new RegExp('"&regex&"','g');
            var b=x.match(y);
            document.write(b);
        </script>")[Data]{0}[Children]{0}[Children]{1}[Text]{0}

in
fx

Unsure if this is possible here and unfamiliar with Regex but I'm wondering if it is possible to modify this function to extract CAS numbers. The difficulty is that CAS Numbers can be in various formats CAS Numbers are up to 10 digits long using the format xxxxxxx-yy-z.

If anyone has any alternative solutions to extracting CAS numbers with this somewhat complex data feel free to post.

Data:

cid and cmpdname can be anything.

1-Aminopropan-2-ol|1-AMINO-2-PROPANOL|78-96-6|Isopropanolamine|Monoisopropanolamine
1-chloro-2,4-dinitrobenzene|2,4-Dinitrochlorobenzene|97-00-7|Dinitrochlorobenzene|DNCB|Chlorodinitrobenzene|CDNB
1,2-dichloroethane|Ethylene dichloride|107-06-2|Ethylene chloride|Ethane, 1,2-dichloro-|Glycol dichloride|Dutch liquid|Dutch oil|Ethane dichloride|Aethylenchloride
1,2,4-trichlorobenzene|120-82-1|Benzene, 1,2,4-trichloro-|unsym-Trichlorobenzene|Hostetex L-pec|Trojchlorobenzene
CHLOROACETALDEHYDE|2-chloroacetaldehyde|107-20-0|Chloroethanal|2-Chloroethanal|Acetaldehyde, chloro-|Chloroaldehyde|Monochloroacetaldehyde|2-Chloro-1-ethanal

CodePudding user response:

In PQ, this will pull out the contents of any item that does not contain a letter in cmpdsynonym, which I think is basically what you are looking for

let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"cmpdsynonym", type text}}),
#"Added Custom1" = Table.AddColumn(#"Changed Type", "Custom.1", each 
Text.Split([cmpdsynonym],"|"){ 
List.PositionOf(
List.Transform(
Text.Split([cmpdsynonym],"|")
, each List.ContainsAny({_}, {"a".."z","A".."Z"}))
,true)
 -1   
}
)
in #"Added Custom1"

enter image description here

CodePudding user response:

Here's one way of doing it in PQ, using fnRegexExtr to return the CAS; and a simple Text.Split to return the chemical compound name:

let

//Read in data and set data type as text
    Source = Excel.CurrentWorkbook(){[Name="Compounds"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),

//Transform to desired output
    Result = Table.FromColumns(
        {List.Transform(#"Changed Type"[Column1], each Text.Split(_,"|")){0}}
        & {List.Transform(#"Changed Type"[Column1],each fnRegexExtr(_, "\\b\\d{1,7}-\\d{2}-\\d"))},
        type table[Compound=text, CAS=text]
        )
in
    Result

Original
enter image description here

Results
enter image description here

  • Related