replace strings with certain format in bash-CodePudding

I have a file like this. it is a 7-column tab file with separator of one space (sep=" ").

however, in the 4th column, it is a string with some words which also contains spaces. Then last 3 column are numbers.

test_find.txt

A UTR3 0.760 Sterile alpha motif domain|Sterile alpha motif domain;Sterile alpha motif domain . . 0.0007
G intergenic 0.673 BTB/POZ domain|BTB/POZ domain|BTB/POZ domain . . 0.0015

I want to replace space into underscore (e.g. replace "Sterile alpha motif domain" to "Sterile_alpha_motif_domain"). Firstly, find the pattern starting with letters and end with "|", then treat as one string and replace all spaces to "_". Then move to next line and find next patter. (Is there any easier way to do it?)

I was able to use sed -i -e 's/Sterile alpha motif domain/Sterile_alpha_motif_domain/g' test_find.txt to only first row, but cannot generalize it.

I tried to find all patterns using sed -n 's/^[^[a-z]]*{$.*$\\[^\|]*$/\1/p' test_find.txt but doesn't work.

can anyone help me?

I want output like this:

A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain . . 0.0015

Thank you!!!!

CodePudding user response：

Assuming you have special character at the end before the final column with integers, You can try this sed

$ sed -E 's~([[:alpha:]/] ) ~\1_~g;s/_([[:punct:]])/ \1/g' input_file
0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015

CodePudding user response：

We'll need to two-step processing: first extract the 4th column which may contain spaces; next replace the spaces in the 4th column with underscores.

With GNU awk:

gawk '{
    if (match($0, /^(([^ ]  ){3})(. )(( [0-9.] ){3})$/, a)) {
        gsub(/ /, "_", a[3])
        print a[1] a[3] a[4]
    }
}' test_find.txt

Output:

A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015

The regex ^(([^ ] ){3})(. )(( [0-9.] ){3})$ matches a line capturing each submatches.
The 3rd argument (GNU awk extension) a is an array name which is assigned to the capture group. a[1] holds 1st-3rd columns, a[3] holds 4th column, and a[4] holds 5th-7th columns.
The gsub function replaces whitespaces with an underscores.
Then the columns are concatenated and printed.

CodePudding user response：

Without making any assumptions on the content of each field, you can 'brute force' the expected result by counting the number of characters in each field ( the number of field separators) for the beginning of the line and the end of the line, and use this to manipulate the '4th column', e.g.

awk '{start=length($1) length($2) length($3) 4; end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6; text=substr($0, start, end); gsub(" ", "_", text); print $1, $2, $3, text, $(NF-2), $(NF-1), $NF}' test.txt

'Neat' version:

awk '{
    start=length($1) length($2) length($3) 4
    end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6
    text=substr($0, start, end)
    gsub(" ", "_", text)
    print $1, $2, $3, text, $(NF-2), $(NF-1), $NF
}' test.txt
A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015

Breakdown:

awk '{
# How many characters are there before column 4 begins (length of each field   total count of field separators (in this case, "4"))
start=length($1) length($2) length($3) 4;
# How many characters are there in column 4 (total - (first 3 fields   last 3 fields   total field separators (6)))
end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6;
# Use the substr function to define column 4
text=substr($0, start, end);
# Substitute spaces for underscores in column 4
gsub(" ", "_", text);
# Print everything
print $1, $2, $3, text, $(NF-2), $(NF-1), $NF
}' test.txt