Home > Enterprise >  replace strings with certain format in bash
replace strings with certain format in bash

Time:07-12

I have a file like this. it is a 7-column tab file with separator of one space (sep=" ").

however, in the 4th column, it is a string with some words which also contains spaces. Then last 3 column are numbers.

test_find.txt

A UTR3 0.760 Sterile alpha motif domain|Sterile alpha motif domain;Sterile alpha motif domain . . 0.0007
G intergenic 0.673 BTB/POZ domain|BTB/POZ domain|BTB/POZ domain . . 0.0015

I want to replace space into underscore (e.g. replace "Sterile alpha motif domain" to "Sterile_alpha_motif_domain"). Firstly, find the pattern starting with letters and end with "|", then treat as one string and replace all spaces to "_". Then move to next line and find next patter. (Is there any easier way to do it?)

I was able to use sed -i -e 's/Sterile alpha motif domain/Sterile_alpha_motif_domain/g' test_find.txt to only first row, but cannot generalize it.

I tried to find all patterns using sed -n 's/^[^[a-z]]*{\(.*\)\\[^\|]*$/\1/p' test_find.txt but doesn't work.

can anyone help me?

I want output like this:

A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain . . 0.0015

Thank you!!!!

CodePudding user response:

Assuming you have special character at the end before the final column with integers, You can try this sed

$ sed -E 's~([[:alpha:]/] ) ~\1_~g;s/_([[:punct:]])/ \1/g' input_file
0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015

CodePudding user response:

We'll need to two-step processing: first extract the 4th column which may contain spaces; next replace the spaces in the 4th column with underscores.

With GNU awk:

gawk '{
    if (match($0, /^(([^ ]  ){3})(. )(( [0-9.] ){3})$/, a)) {
        gsub(/ /, "_", a[3])
        print a[1] a[3] a[4]
    }
}' test_find.txt

Output:

A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015
  • The regex ^(([^ ] ){3})(. )(( [0-9.] ){3})$ matches a line capturing each submatches.
  • The 3rd argument (GNU awk extension) a is an array name which is assigned to the capture group. a[1] holds 1st-3rd columns, a[3] holds 4th column, and a[4] holds 5th-7th columns.
  • The gsub function replaces whitespaces with an underscores.
  • Then the columns are concatenated and printed.

CodePudding user response:

Without making any assumptions on the content of each field, you can 'brute force' the expected result by counting the number of characters in each field ( the number of field separators) for the beginning of the line and the end of the line, and use this to manipulate the '4th column', e.g.

awk '{start=length($1) length($2) length($3) 4; end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6; text=substr($0, start, end); gsub(" ", "_", text); print $1, $2, $3, text, $(NF-2), $(NF-1), $NF}' test.txt

'Neat' version:

awk '{
    start=length($1) length($2) length($3) 4
    end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6
    text=substr($0, start, end)
    gsub(" ", "_", text)
    print $1, $2, $3, text, $(NF-2), $(NF-1), $NF
}' test.txt
A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015

Breakdown:

awk '{
# How many characters are there before column 4 begins (length of each field   total count of field separators (in this case, "4"))
start=length($1) length($2) length($3) 4;
# How many characters are there in column 4 (total - (first 3 fields   last 3 fields   total field separators (6)))
end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6;
# Use the substr function to define column 4
text=substr($0, start, end);
# Substitute spaces for underscores in column 4
gsub(" ", "_", text);
# Print everything
print $1, $2, $3, text, $(NF-2), $(NF-1), $NF
}' test.txt
  • Related