Using unix (awk, sed, bash?) to truncate items in a column by the 4th underscore?-CodePudding

I have series of files that look like this, the second and third column are duplicates but with thousands of lines.

AT1G15820.1 TRINITY_DN96909_c1_g2_i1.p1 TRINITY_DN96909_c1_g2_i1.p1 1.36e-115
AT1G15820.1 TRINITY_DN96909_c1_g1_i2.p1 TRINITY_DN96909_c1_g1_i2.p1 9.97e-113
AT1G15820.1 TRINITY_DN96909_c1_g1_i1.p1 TRINITY_DN96909_c1_g1_i1.p1 6.26e-66

I want to take the 3rd column and truncate it so that everything in the string after and including _i is deleted, like so:

AT1G15820.1 TRINITY_DN96909_c1_g2_i1.p1 TRINITY_DN96909_c1_g2 1.36e-115
AT1G15820.1 TRINITY_DN96909_c1_g1_i2.p1 TRINITY_DN96909_c1_g1 9.97e-113
AT1G15820.1 TRINITY_DN96909_c1_g1_i1.p1 TRINITY_DN96909_c1_g1 6.26e-66

The numbers after each letter combination (DN, c, g, i, p) could be anything and could also be any length, so I can't just truncate to a certain length.

I've tried sed -i 's/_i.*//' file.txt But this deleted everything after each line and not just the column of interest.

Thanks so much!

CodePudding user response：

You can remove from the first occurrence of _i followed by the rest of the line in the third field using awk:

awk 'sub(/_i.*/, "", $3)1' file

Output

AT1G15820.1 TRINITY_DN96909_c1_g2_i1.p1 TRINITY_DN96909_c1_g2 1.36e-115
AT1G15820.1 TRINITY_DN96909_c1_g1_i2.p1 TRINITY_DN96909_c1_g1 9.97e-113
AT1G15820.1 TRINITY_DN96909_c1_g1_i1.p1 TRINITY_DN96909_c1_g1 6.26e-66

CodePudding user response：

Using sed

$ sed -i.bak 's/\(g[0-9]*\)_[^ ]*/\1/2' input_file
AT1G15820.1 TRINITY_DN96909_c1_g2_i1.p1 TRINITY_DN96909_c1_g2 1.36e-115
AT1G15820.1 TRINITY_DN96909_c1_g1_i2.p1 TRINITY_DN96909_c1_g1 9.97e-113
AT1G15820.1 TRINITY_DN96909_c1_g1_i1.p1 TRINITY_DN96909_c1_g1 6.26e-66

You can carry out the substitution on the second occurrence of the match

CodePudding user response：

With GNU awk. Replace last six characters in third column with nothing:

awk '{ gsub(/......$/, "", $3); print }' file

awk '{ gsub(/.{6}$/, "", $3) }1' file

CodePudding user response：

This will do it using GNU sed:

sed 's/\(.*_i.*\)_i.*\(\s.*\)/\1\2/' your_file > output_file

\( and \) are capture groups, that remember the matched stuff inside
\(.*_i.*\)_i.* remembers everything up to (but not including) the second _i
\(\s.*\) remembers everything from the space after the second _i to the end of the line
/\1\2/ replaces the line with the first and second capture group (i.e. dropping out everything from the second _i up to the first space after that.

Outputs

AT1G15820.1 TRINITY_DN96909_c1_g2_i1.p1 TRINITY_DN96909_c1_g2 1.36e-115
AT1G15820.1 TRINITY_DN96909_c1_g1_i2.p1 TRINITY_DN96909_c1_g1 9.97e-113
AT1G15820.1 TRINITY_DN96909_c1_g1_i1.p1 TRINITY_DN96909_c1_g1 6.26e-66

CodePudding user response：

awk '{sub(/_[^_] $/,"",$3)}1' file

AT1G15820.1 TRINITY_DN96909_c1_g2_i1.p1 TRINITY_DN96909_c1_g2 1.36e-115
AT1G15820.1 TRINITY_DN96909_c1_g1_i2.p1 TRINITY_DN96909_c1_g1 9.97e-113
AT1G15820.1 TRINITY_DN96909_c1_g1_i1.p1 TRINITY_DN96909_c1_g1 6.26e-66

in 3rd field delete everything after the last underscore (including).

CodePudding user response：

a perl one-liner:

perl -lane '$F[2] = join "_", (split /_/, $F[2])[0..3]; print "@F"' file

That splits the 3rd field on underscore, takes the first 4 components and joins them with underscore.