Renaming variant column-CodePudding

I have a large file with rsIDs in the 2nd field.

Some variants are in this format: chr1-97981343:rs55886062-AT

Using bash commands, how can I replace these identifiers to just print the rsID (e.g. rs55886062)?

Toy data set:

1   rs3918290   110 97915614    A   G
1   chr1-97981343:rs55886062-AT 110 97981343    A   T
1   rs72549303  110 97915622    C   A
1   rs17376848  110 97915624    G   A
1   rs59086055  110 97915746    A   G

The desired output:

1   rs3918290   110 97915614    A   G
1   rs55886062  110 97981343    A   T
1   rs72549303  110 97915622    C   A
1   rs17376848  110 97915624    G   A
1   rs59086055  110 97915746    A   G

CodePudding user response：

If the variant format is always structured with : and -, and if you don't mind tweaking the whitespace of the file, you can do:

awk 'split($2, a, ":") && a[2]{ split(a[2], b, "-"); $2 = b[1] }{$1 = $1}1' input

CodePudding user response：

Some more samples would help to construct a regexp pattern. Here's one possible solution:

$ sed -E 's/\<chr[0-9] -[0-9] :(rs[0-9] )-[A-Z] /\1/' ip.txt
1   rs3918290   110 97915614    A   G
1   rs55886062 110 97981343    A   T
1   rs72549303  110 97915622    C   A
1   rs17376848  110 97915624    G   A
1   rs59086055  110 97915746    A   G

\< start of word anchor
chr[0-9] -[0-9] : match chr followed by one or more digits followed by - followed by one or more digits followed by :
(rs[0-9] ) capture rs followed by one or more digits
-[A-Z] match - followed by one or more uppercase characters