I have a large file with rsIDs in the 2nd field.
Some variants are in this format: chr1-97981343:rs55886062-AT
Using bash commands, how can I replace these identifiers to just print the rsID (e.g. rs55886062)?
Toy data set:
1 rs3918290 110 97915614 A G
1 chr1-97981343:rs55886062-AT 110 97981343 A T
1 rs72549303 110 97915622 C A
1 rs17376848 110 97915624 G A
1 rs59086055 110 97915746 A G
The desired output:
1 rs3918290 110 97915614 A G
1 rs55886062 110 97981343 A T
1 rs72549303 110 97915622 C A
1 rs17376848 110 97915624 G A
1 rs59086055 110 97915746 A G
CodePudding user response:
If the variant format is always structured with :
and -
, and if you don't mind tweaking the whitespace of the file, you can do:
awk 'split($2, a, ":") && a[2]{ split(a[2], b, "-"); $2 = b[1] }{$1 = $1}1' input
CodePudding user response:
Some more samples would help to construct a regexp pattern. Here's one possible solution:
$ sed -E 's/\<chr[0-9] -[0-9] :(rs[0-9] )-[A-Z] /\1/' ip.txt
1 rs3918290 110 97915614 A G
1 rs55886062 110 97981343 A T
1 rs72549303 110 97915622 C A
1 rs17376848 110 97915624 G A
1 rs59086055 110 97915746 A G
\<
start of word anchorchr[0-9] -[0-9] :
matchchr
followed by one or more digits followed by-
followed by one or more digits followed by:
(rs[0-9] )
capturers
followed by one or more digits-[A-Z]
match-
followed by one or more uppercase characters