Home > Blockchain >  Renaming variant column
Renaming variant column

Time:10-17

I have a large file with rsIDs in the 2nd field.

Some variants are in this format: chr1-97981343:rs55886062-AT

Using bash commands, how can I replace these identifiers to just print the rsID (e.g. rs55886062)?

Toy data set:

1   rs3918290   110 97915614    A   G
1   chr1-97981343:rs55886062-AT 110 97981343    A   T
1   rs72549303  110 97915622    C   A
1   rs17376848  110 97915624    G   A
1   rs59086055  110 97915746    A   G

The desired output:

1   rs3918290   110 97915614    A   G
1   rs55886062  110 97981343    A   T
1   rs72549303  110 97915622    C   A
1   rs17376848  110 97915624    G   A
1   rs59086055  110 97915746    A   G

CodePudding user response:

If the variant format is always structured with : and -, and if you don't mind tweaking the whitespace of the file, you can do:

awk 'split($2, a, ":") && a[2]{ split(a[2], b, "-"); $2 = b[1] }{$1 = $1}1' input

CodePudding user response:

Some more samples would help to construct a regexp pattern. Here's one possible solution:

$ sed -E 's/\<chr[0-9] -[0-9] :(rs[0-9] )-[A-Z] /\1/' ip.txt
1   rs3918290   110 97915614    A   G
1   rs55886062 110 97981343    A   T
1   rs72549303  110 97915622    C   A
1   rs17376848  110 97915624    G   A
1   rs59086055  110 97915746    A   G
  • \< start of word anchor
  • chr[0-9] -[0-9] : match chr followed by one or more digits followed by - followed by one or more digits followed by :
  • (rs[0-9] ) capture rs followed by one or more digits
  • -[A-Z] match - followed by one or more uppercase characters
  • Related