Home > Back-end >  print single line from file that matches list of patterns in the second column
print single line from file that matches list of patterns in the second column

Time:05-26

EDIT: I'm working in a Linux cluster.

I have a huge file that lists in the 1st column an ID, the second is a combination of columns from the original file that is replicated in the 4th-5th-6th columns. Input file look like this:

1       1:71:T:C        0       71      C       T
1       1:71:T:A        0       71      A       T
1       1:72:GTGTGTGTT:G        0       72      G       GTGTGTGTT
1       1:75:T:C        0       75      C       T
1       1:75:T:*        0       75      *       T
1       1:76:GTGTT:G    0       76      G       GTGTT
1       1:76:GTGTT:*    0       76      *       GTGTT
1       1:83:C:CAT      0       83      CAT     C
1       1:87:CGT:C      0       87      C       CGT
1       1:87:C:CGTGTGT  0       87      CGTGTGT C
U       U:19874536:G:A  0       19874536        A       G
U       U:19874560:G:A  0       19874560        A       G
U       U:19874575:C:T  0       19874575        T       C
U       U:19874577:T:G  0       19874577        G       T
U       U:19874587:CA:C 0       19874587        C       CA
U       U:19874587:CAA:C        0       19874587        C       CAA
U       U:19874602:C:T  0       19874602        T       C
U       U:19876478:T:C  0       19876478        C       T
U       U:19876534:C:A  0       19876534        A       C
U       U:19876568:T:C  0       19876568        C       T
22      X:29:G:GT       0       29      G       GT
22      X:96:T:A        0       96      A       T
22      X:146:A:G       0       146     G       A
22      X:167:A:T       0       167     T       A
22      X:168:T:C       0       168     C       T
22      X:244:C:T       0       244     T       C
22      X:253:C:A       0       253     A       C
22      X:254:C:A       0       254     A       C
22      X:330:G:T       0       330     T       G
22      X:371:GGCGTTTACGT:G     0       371     G       GGCGTTTACGT
.
.
.

I'm trying to check how the 1st column (ID) is matching the original ID in the 2nd column, so I just wanted to print the first line that matches a list of the original ID (in the second column). I hope that was clear! I saw this solution here, and I think it should be able to help me out, but I'm not familiar with awk and I don't know how to edit this so the match refers to only the ID (before the ':') in the 2nd column.

EDIT: Expected output:

 1       1:71:T:C        0       71      C       T
 U       U:19874536:G:A  0       19874536        A       G
 22      X:29:G:GT       0       29      G       GT
 .
 .
 .

CodePudding user response:

A Perl solution:

perl -F'/[\s:] /' -lane 'BEGIN { %matches = ( 22 => "X", ); } print if ( ( $F[0] eq $F[1] || $F[1] eq $matches{ $F[0] } ) && !$seen{ $F[0] }   ); ' infile > outfile

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array @F on whitespace or on the regex specified in -F option.
-F'/[\s:] /' : Split into @F on whitespace or on :, repeated 1 or more times, rather than on whitespace.

%matches = ( 22 => "X", ); - create hash %matches, which maps matching IDs from column 1 to column 2. For speed, this is placed in the BEGIN { ... } block, which is executed only once at the beginning of the script, before the subsequent code is run, which is run for every input line.
!$seen{ $F[0] } : true only for the first occurrence of each value in the first column.

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

  • Related