Home > front end >  removing multiple instances of a string in a line with sed
removing multiple instances of a string in a line with sed

Time:01-11

I have a large tab delimited file that I'd like keep only a certain string (GO:#######) that appears multiple (and variable) times in each line as well as lines that are blank containing a period. When I use SED to replace all the non-GO strings it removes the entire middle of the line. How do I prevent this?

SED command I'm using and other permutations

sed -r 's/\t`. `\t//g' file1.txt > file2.txt

What I have

GO:1234567    `text1`moretext`    GO:5373845    `diff`text`     GO:5438534     `text`text
.
GO:3333333     `txt`text`    GO:5553535    `misc`text
.
.

What I'd like

GO:1234567    GO:5373845    GO:5438534
.
GO:3333333    GO:5553535
.
.

What I get

GO:1234567    GO:5438534     `text`text
.
GO:3333333    GO:5553535    `misc`text
.
.

CodePudding user response:

With GNU awk:

awk 'BEGIN{FPAT="GO:[0-9] "; OFS="\t"} {$1=$1; print}' file

Output is tab delimited:

GO:1234567  GO:5373845  GO:5438534

GO:3333333  GO:5553535

From man awk:

FPAT: A regular expression describing the contents of the fields in a record. When set, gawk parses the in‐ put into fields, where the fields match the regular expression, instead of using the value of FS as the field separator.

See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

CodePudding user response:

sed -E 's/\t`[^\t]*//g'
  • \t- tab
  • ` - a literal backtick
  • [^\t]* - any non-tab character 0 or more times

Alternative:

sed -E 's/\t(`[^`]*){2}`?//g'
  • \t - tab
  • ( - start of group
    • ` - a literal backtick
    • [^`]* - any non-backticks 0 or more times
  • ) - end of group
  • {2} - repeat group twice
  • `? - an optional backtick (since the last column only has 2 instead of 3)

... and substitute with an empty string.

Output:

GO:1234567      GO:5373845      GO:5438534
.
GO:3333333      GO:5553535
.
.

Note: These examples assumes that there is exactly one tab between columns. It's hard to see here.

CodePudding user response:

I would match explicitly non `.

s/`[^`]*`[^`]*`//

Regex is greedy, `. ` matches anything, from the first backtick up to the last backtick.

CodePudding user response:

This awk solution would work with any version of awk:

awk -v OFS='\t' '{
   for (i=1; i<=NF;   i)
      if ($i ~ /^GO:/)
         s = (s ? s OFS : "") $i
   print s
   s = ""
}' file

GO:1234567  GO:5373845  GO:5438534
GO:3333333  GO:5553535
GO:3333333
  • Related