I have a large tab delimited file that I'd like keep only a certain string (GO:#######) that appears multiple (and variable) times in each line as well as lines that are blank containing a period. When I use SED to replace all the non-GO strings it removes the entire middle of the line. How do I prevent this?
SED command I'm using and other permutations
sed -r 's/\t`. `\t//g' file1.txt > file2.txt
What I have
GO:1234567 `text1`moretext` GO:5373845 `diff`text` GO:5438534 `text`text
.
GO:3333333 `txt`text` GO:5553535 `misc`text
.
.
What I'd like
GO:1234567 GO:5373845 GO:5438534
.
GO:3333333 GO:5553535
.
.
What I get
GO:1234567 GO:5438534 `text`text
.
GO:3333333 GO:5553535 `misc`text
.
.
CodePudding user response:
With GNU awk
:
awk 'BEGIN{FPAT="GO:[0-9] "; OFS="\t"} {$1=$1; print}' file
Output is tab delimited:
GO:1234567 GO:5373845 GO:5438534 GO:3333333 GO:5553535
From man awk
:
FPAT: A regular expression describing the contents of the fields in a record. When set, gawk parses the in‐ put into fields, where the fields match the regular expression, instead of using the value of FS as the field separator.
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
CodePudding user response:
sed -E 's/\t`[^\t]*//g'
\t
- tab`
- a literal backtick[^\t]*
- any non-tab character 0 or more times
Alternative:
sed -E 's/\t(`[^`]*){2}`?//g'
\t
- tab(
- start of group`
- a literal backtick[^`]*
- any non-backticks 0 or more times
)
- end of group{2}
- repeat group twice`?
- an optional backtick (since the last column only has 2 instead of 3)
... and substitute with an empty string.
Output:
GO:1234567 GO:5373845 GO:5438534
.
GO:3333333 GO:5553535
.
.
Note: These examples assumes that there is exactly one tab between columns. It's hard to see here.
CodePudding user response:
I would match explicitly non `.
s/`[^`]*`[^`]*`//
Regex is greedy, `. `
matches anything, from the first backtick up to the last backtick.
CodePudding user response:
This awk
solution would work with any version of awk
:
awk -v OFS='\t' '{
for (i=1; i<=NF; i)
if ($i ~ /^GO:/)
s = (s ? s OFS : "") $i
print s
s = ""
}' file
GO:1234567 GO:5373845 GO:5438534
GO:3333333 GO:5553535
GO:3333333