I am working with text files containing somewhere a string
ligand_types A C Cl NA OA N SA HD # ligand atom types
I need to check the patterns in the string: A C Cl NA OA N SA HD and in the case of the absence add to the end the missed "SA" and/or "HD".
Sample input:
ligand_types A C Cl NA OA N HD # ligand atom types
ligand_types A C NA OA N SA # ligand atom types
ligand_types A C Cl NA OA N # ligand atom types
Expected output:
ligand_types A C Cl NA OA N HD SA # ligand atom types
ligand_types A C NA OA N SA HD # ligand atom types
ligand_types A C Cl NA OA N HD SA # ligand atom types
May you suggest me sed or awk solution for this task?
For example using SED it may be:
sed -i 's/old-string/new-string-with-additional-patterns/g' my_file
CodePudding user response:
Update: adding support for end-of-line comments
Here's an awk
idea:
awk -v types="SA HD" '
BEGIN {
typesCount = split(types,typesArr)
}
/^ligand_types / {
if (i = index($0,"#")) {
comment = substr($0,i)
$0 = substr($0,1,i-1)
} else
comment = ""
for (i = 1; i <= typesCount; i )
typesHash[typesArr[i]]
for (i = 2; i <= NF; i )
delete typesHash[$i]
for (t in typesHash)
$(NF 1) = t
if (comment)
$(NF 1) = comment
}
1
' my_file > my_file.new
With your examples:
ligand_types A C Cl NA OA N SA HD # ligand atom types
ligand_types A C Cl NA OA N SA HD # ligand atom types
ligand_types A C Cl NA OA N HD # ligand atom types
ligand_types A C Cl NA OA N HD SA # ligand atom types
ligand_types A C NA OA N SA # ligand atom types
ligand_types A C NA OA N SA HD # ligand atom types
ligand_types A C Cl NA OA N # ligand atom types
ligand_types A C Cl NA OA N HD SA # ligand atom types
CodePudding user response:
You could just remove any SA and/or HD that exist in the input and then always add them to the output:
$ cat tst.awk
{
if ( match($0,/[[:space:]]*#/) ) {
head = substr($0,1,RSTART-1) " "
tail = substr($0,RSTART)
}
else {
head = $0 " "
tail = ""
}
gsub(/( (SA|HD) )|( $)/,"",head)
print head, "SA HD" tail
}
$ awk -f tst.awk file
ligand_types A C Cl NA OA N SA HD # ligand atom types
ligand_types A C NA OA N SA HD # ligand atom types
ligand_types A C Cl NA OA N SA HD # ligand atom types
CodePudding user response:
GNU sed
solution, let file.txt
content be
ligand_types A C Cl NA OA N HD # ligand atom types
ligand_types A C NA OA N SA # ligand atom types
ligand_types A C Cl NA OA N # ligand atom types
then
sed -e '/SA/! s/\(.*\)\([A-Z]\)/\1\2 SA/' -e '/HD/! s/\(.*\)\([A-Z]\)/\1\2 HD/' file.txt
gives output
ligand_types A C Cl NA OA N HD SA # ligand atom types
ligand_types A C NA OA N SA HD # ligand atom types
ligand_types A C Cl NA OA N SA HD # ligand atom types
Explanation: I register 2 expressions, first is for lines which do not contain SA
, in such case I add SA
after last uppercase letter. .*
matches greedy [A-Z]
is then last uppercase letter, as these are not to be removed I put them back in replacement using \1
and \2
respectively. 2nd expression does same thing but for HD
. Note that it does add SA
before HD
if both are missing as stipulated by
need to check the patterns in the string: A C Cl NA OA N SA HD
Disclaimer: this solution assumes that comment is always lowercase
(tested in GNU sed 4.7)
CodePudding user response:
With your shown samples, please try following awk
code. Written and tested in GNU awk
. Simple explanation would be, setting RS
to [[:space:]] #
and then checking 3 conditions if current line ends with HD
and add SA
to it, if current line ends with SA
then add HD
to it OR if line doesn't end with both(HD and SA) then add them both and print the output in shown format by OP.
awk -v RS='[[:space:]] #' '
$0~/HD$/ && !found2{ sub(/HD$/,"& SA");found1=1 }
$0~/SA$/ && !found1{ sub(/SA$/,"& HD");found2=1 }
RT && $0!~/HD$/ && $0!~/SA$/{ $0=$0" HD SA" }
{ ORS=RT; found1=found2=""}
1
' Input_file