EDIT: I'm working in a Linux cluster.
I have a huge file that lists in the 1st column an ID, the second is a combination of columns from the original file that is replicated in the 4th-5th-6th columns. Input file look like this:
1 1:71:T:C 0 71 C T
1 1:71:T:A 0 71 A T
1 1:72:GTGTGTGTT:G 0 72 G GTGTGTGTT
1 1:75:T:C 0 75 C T
1 1:75:T:* 0 75 * T
1 1:76:GTGTT:G 0 76 G GTGTT
1 1:76:GTGTT:* 0 76 * GTGTT
1 1:83:C:CAT 0 83 CAT C
1 1:87:CGT:C 0 87 C CGT
1 1:87:C:CGTGTGT 0 87 CGTGTGT C
U U:19874536:G:A 0 19874536 A G
U U:19874560:G:A 0 19874560 A G
U U:19874575:C:T 0 19874575 T C
U U:19874577:T:G 0 19874577 G T
U U:19874587:CA:C 0 19874587 C CA
U U:19874587:CAA:C 0 19874587 C CAA
U U:19874602:C:T 0 19874602 T C
U U:19876478:T:C 0 19876478 C T
U U:19876534:C:A 0 19876534 A C
U U:19876568:T:C 0 19876568 C T
22 X:29:G:GT 0 29 G GT
22 X:96:T:A 0 96 A T
22 X:146:A:G 0 146 G A
22 X:167:A:T 0 167 T A
22 X:168:T:C 0 168 C T
22 X:244:C:T 0 244 T C
22 X:253:C:A 0 253 A C
22 X:254:C:A 0 254 A C
22 X:330:G:T 0 330 T G
22 X:371:GGCGTTTACGT:G 0 371 G GGCGTTTACGT
.
.
.
I'm trying to check how the 1st column (ID) is matching the original ID in the 2nd column, so I just wanted to print the first line that matches a list of the original ID (in the second column). I hope that was clear! I saw this solution here, and I think it should be able to help me out, but I'm not familiar with awk and I don't know how to edit this so the match refers to only the ID (before the ':') in the 2nd column.
EDIT: Expected output:
1 1:71:T:C 0 71 C T
U U:19874536:G:A 0 19874536 A G
22 X:29:G:GT 0 29 G GT
.
.
.
CodePudding user response:
A Perl solution:
perl -F'/[\s:] /' -lane 'BEGIN { %matches = ( 22 => "X", ); } print if ( ( $F[0] eq $F[1] || $F[1] eq $matches{ $F[0] } ) && !$seen{ $F[0] } ); ' infile > outfile
The Perl one-liner uses these command line flags:
-e
: Tells Perl to look for code in-line, instead of in a file.
-n
: Loop over the input one line at a time, assigning it to $_
by default.
-l
: Strip the input line separator ("\n"
on *NIX by default) before executing the code in-line, and append it when printing.
-a
: Split $_
into array @F
on whitespace or on the regex specified in -F
option.
-F'/[\s:] /'
: Split into @F
on whitespace or on :
, repeated 1 or more times, rather than on whitespace.
%matches = ( 22 => "X", );
- create hash %matches
, which maps matching IDs from column 1 to column 2. For speed, this is placed in the BEGIN { ... }
block, which is executed only once at the beginning of the script, before the subsequent code is run, which is run for every input line.
!$seen{ $F[0] }
: true only for the first occurrence of each value in the first column.
SEE ALSO:
perldoc perlrun
: how to execute the Perl interpreter: command line switches