sed/awk: adding missed patterns to the string-CodePudding

I am working with text files containing somewhere a string

ligand_types A C Cl NA OA N SA HD    # ligand atom types

I need to check the patterns in the string: A C Cl NA OA N SA HD and in the case of the absence add to the end the missed "SA" and/or "HD".

Sample input:

ligand_types A C Cl NA OA N HD       # ligand atom types
ligand_types A C NA OA N SA          # ligand atom types
ligand_types A C Cl NA OA N          # ligand atom types

Expected output:

ligand_types A C Cl NA OA N HD SA    # ligand atom types
ligand_types A C NA OA N SA HD       # ligand atom types
ligand_types A C Cl NA OA N HD SA    # ligand atom types

May you suggest me sed or awk solution for this task?

For example using SED it may be:

sed -i 's/old-string/new-string-with-additional-patterns/g' my_file

CodePudding user response：

Update: adding support for end-of-line comments

Here's an awk idea:

awk -v types="SA HD" '
    BEGIN {
        typesCount = split(types,typesArr)
    }
    /^ligand_types / {

        if (i = index($0,"#")) {
            comment = substr($0,i)
            $0 = substr($0,1,i-1)
        } else
            comment = ""

        for (i = 1; i <= typesCount; i  )
            typesHash[typesArr[i]]

        for (i = 2; i <= NF; i  )
            delete typesHash[$i]

        for (t in typesHash)
            $(NF 1) = t

        if (comment)
            $(NF 1) = comment
    }
    1
' my_file > my_file.new

With your examples:

ligand_types A C Cl NA OA N SA HD    # ligand atom types
ligand_types A C Cl NA OA N SA HD # ligand atom types

ligand_types A C Cl NA OA N HD       # ligand atom types
ligand_types A C Cl NA OA N HD SA # ligand atom types

ligand_types A C NA OA N SA          # ligand atom types
ligand_types A C NA OA N SA HD # ligand atom types

ligand_types A C Cl NA OA N          # ligand atom types
ligand_types A C Cl NA OA N HD SA # ligand atom types

CodePudding user response：

You could just remove any SA and/or HD that exist in the input and then always add them to the output:

$ cat tst.awk
{
    if ( match($0,/[[:space:]]*#/) ) {
        head = substr($0,1,RSTART-1) " "
        tail = substr($0,RSTART)
    }
    else {
        head = $0 " "
        tail = ""
    }
    gsub(/( (SA|HD) )|( $)/,"",head)
    print head, "SA HD" tail
}

$ awk -f tst.awk file
ligand_types A C Cl NA OA N SA HD       # ligand atom types
ligand_types A C NA OA N SA HD          # ligand atom types
ligand_types A C Cl NA OA N SA HD          # ligand atom types

CodePudding user response：

GNU sed solution, let file.txt content be

ligand_types A C Cl NA OA N HD       # ligand atom types
ligand_types A C NA OA N SA          # ligand atom types
ligand_types A C Cl NA OA N          # ligand atom types

then

sed -e '/SA/! s/\(.*\)\([A-Z]\)/\1\2 SA/' -e '/HD/! s/\(.*\)\([A-Z]\)/\1\2 HD/' file.txt

gives output

ligand_types A C Cl NA OA N HD SA       # ligand atom types
ligand_types A C NA OA N SA HD          # ligand atom types
ligand_types A C Cl NA OA N SA HD          # ligand atom types

Explanation: I register 2 expressions, first is for lines which do not contain SA, in such case I add SA after last uppercase letter. .* matches greedy [A-Z] is then last uppercase letter, as these are not to be removed I put them back in replacement using \1 and \2 respectively. 2nd expression does same thing but for HD. Note that it does add SA before HD if both are missing as stipulated by

need to check the patterns in the string: A C Cl NA OA N SA HD

Disclaimer: this solution assumes that comment is always lowercase

(tested in GNU sed 4.7)

CodePudding user response：

With your shown samples, please try following awk code. Written and tested in GNU awk. Simple explanation would be, setting RS to [[:space:]] # and then checking 3 conditions if current line ends with HD and add SA to it, if current line ends with SA then add HD to it OR if line doesn't end with both(HD and SA) then add them both and print the output in shown format by OP.

awk -v RS='[[:space:]] #' '
$0~/HD$/ && !found2{ sub(/HD$/,"& SA");found1=1 }
$0~/SA$/ && !found1{ sub(/SA$/,"& HD");found2=1 }
RT && $0!~/HD$/ && $0!~/SA$/{ $0=$0" HD SA" }
{ ORS=RT; found1=found2=""}
1
'   Input_file