Remove lines duplicated more than n times-CodePudding

I want to remove lines duplicated more than 3 times (or 4 times) in the first 3 columns. The main goal is to remove lines where the genomic coordinates are duplicated more than 3 or 4 times.

Input file.tsv

chr	position	position2	ref	alt
chr21	10464942	10464942	T	C
chr21	10464942	10464942	T	C
chr21	10464961	10464961	A	G
chr21	10464961	10464961	C	G
chr21	10464961	10464961	A	G
chr21	10464961	10464961	T	C
chr21	10465086	10465086	T	C

Desired output if n=3

chr	position	position2	ref	alt
chr21	10464942	10464942	T	C
chr21	10464942	10464942	T	C
chr21	10465086	10465086	T	C

I tried awk '{if(!seen[$1,$2,$3] ) {if( count[$1,$2,$3]<=3) print} }' and some sort and uniq combinations, but they don't get me the output I want.

CodePudding user response：

Annotating with a dup count lets us easily solve this.

Python will be more convenient than awk.

import csv
import typer

def get_annotated_rows(sheet, prefix_length=3):
    """Generates (count, row) tuples.

    Count for a row will be 1 if this is the first time we've seen it,
    and increments with each duplicate row.
    We assess duplicates by examining just an initial prefix of each row.
    """
    prev = None
    count = 0
    for row in sheet:
        prefix = row[:prefix_length]
        if prefix != prev:
            prev = prefix
            count = 1
        else:
            count  = 1
        yield count, row

def main(infile: str = "input_file.tsv", n: int = 4):
    with open(infile) as fin:
        sheet = csv.reader(fin, delimiter="\t")
        for count, row in get_annotated_rows(sheet):
            if count <= n:
                print("\t".join(row))

if __name__ == '__main__':
    typer.run(main)

install:

$ pip install typer

CodePudding user response：

A common shell scripting trick is to reformat data so it can be processed easily by the *nix utilities. Often the troublesome utility is the uniq command, with it -f (skip fields option), where fields are skipped at the front of the record. Many times you wish you could skip at the end of the record, so we rely on awk to reformat data to have the skipable fields at the end:

#!/bin/bash
tail -n 2 data.txt  \
|  awk '{print $4 " " $5 " " $1 "_" $2 "_" $3 }' \
| sort -k3 | uniq -cf2 \
| awk '$1<3{
  split($4,arr,"_")
  for (i=1;i<=$1;i  ) {
    print arr[1]"\t" arr[2]"\t" arr[3]"\t" $2 " " $3
  }
}'

Output

chr21   10464942        10464942        T C
chr21   10464942        10464942        T C
chr21   10465086        10465086        T C

You can change the field separators in the print statement as needed for your final input.

(And this code can be folded up onto one line, giving the much desired (if misvalued) "oneliner" (-: ). IHTH