I have a file like this. it is a 7-column tab file with separator of one space (sep=" "
).
however, in the 4th column, it is a string with some words which also contains spaces. Then last 3 column are numbers.
test_find.txt
A UTR3 0.760 Sterile alpha motif domain|Sterile alpha motif domain;Sterile alpha motif domain . . 0.0007
G intergenic 0.673 BTB/POZ domain|BTB/POZ domain|BTB/POZ domain . . 0.0015
I want to replace space into underscore (e.g. replace "Sterile alpha motif domain" to "Sterile_alpha_motif_domain"). Firstly, find the pattern starting with letters and end with "|", then treat as one string and replace all spaces to "_". Then move to next line and find next patter. (Is there any easier way to do it?)
I was able to use sed -i -e 's/Sterile alpha motif domain/Sterile_alpha_motif_domain/g' test_find.txt
to only first row, but cannot generalize it.
I tried to find all patterns using sed -n 's/^[^[a-z]]*{\(.*\)\\[^\|]*$/\1/p' test_find.txt
but doesn't work.
can anyone help me?
I want output like this:
A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain . . 0.0015
Thank you!!!!
CodePudding user response:
Assuming you have special character at the end before the final column with integers, You can try this sed
$ sed -E 's~([[:alpha:]/] ) ~\1_~g;s/_([[:punct:]])/ \1/g' input_file
0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015
CodePudding user response:
We'll need to two-step processing: first extract the 4th column which may contain spaces; next replace the spaces in the 4th column with underscores.
With GNU awk
:
gawk '{
if (match($0, /^(([^ ] ){3})(. )(( [0-9.] ){3})$/, a)) {
gsub(/ /, "_", a[3])
print a[1] a[3] a[4]
}
}' test_find.txt
Output:
A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015
- The regex
^(([^ ] ){3})(. )(( [0-9.] ){3})$
matches a line capturing each submatches. - The 3rd argument (GNU awk extension)
a
is an array name which is assigned to the capture group. a[1] holds 1st-3rd columns, a[3] holds 4th column, and a[4] holds 5th-7th columns. - The
gsub
function replaces whitespaces with an underscores. - Then the columns are concatenated and printed.
CodePudding user response:
Without making any assumptions on the content of each field, you can 'brute force' the expected result by counting the number of characters in each field ( the number of field separators) for the beginning of the line and the end of the line, and use this to manipulate the '4th column', e.g.
awk '{start=length($1) length($2) length($3) 4; end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6; text=substr($0, start, end); gsub(" ", "_", text); print $1, $2, $3, text, $(NF-2), $(NF-1), $NF}' test.txt
'Neat' version:
awk '{
start=length($1) length($2) length($3) 4
end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6
text=substr($0, start, end)
gsub(" ", "_", text)
print $1, $2, $3, text, $(NF-2), $(NF-1), $NF
}' test.txt
A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015
Breakdown:
awk '{
# How many characters are there before column 4 begins (length of each field total count of field separators (in this case, "4"))
start=length($1) length($2) length($3) 4;
# How many characters are there in column 4 (total - (first 3 fields last 3 fields total field separators (6)))
end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6;
# Use the substr function to define column 4
text=substr($0, start, end);
# Substitute spaces for underscores in column 4
gsub(" ", "_", text);
# Print everything
print $1, $2, $3, text, $(NF-2), $(NF-1), $NF
}' test.txt