I want to remove lines duplicated more than 3 times (or 4 times) in the first 3 columns. The main goal is to remove lines where the genomic coordinates are duplicated more than 3 or 4 times.
Input file.tsv
chr | position | position2 | ref | alt |
---|---|---|---|---|
chr21 | 10464942 | 10464942 | T | C |
chr21 | 10464942 | 10464942 | T | C |
chr21 | 10464961 | 10464961 | A | G |
chr21 | 10464961 | 10464961 | C | G |
chr21 | 10464961 | 10464961 | A | G |
chr21 | 10464961 | 10464961 | T | C |
chr21 | 10465086 | 10465086 | T | C |
Desired output if n=3
chr | position | position2 | ref | alt |
---|---|---|---|---|
chr21 | 10464942 | 10464942 | T | C |
chr21 | 10464942 | 10464942 | T | C |
chr21 | 10465086 | 10465086 | T | C |
I tried awk '{if(!seen[$1,$2,$3] ) {if( count[$1,$2,$3]<=3) print} }' and some sort and uniq combinations, but they don't get me the output I want.
CodePudding user response:
Annotating with a dup count lets us easily solve this.
Python will be more convenient than awk.
import csv
import typer
def get_annotated_rows(sheet, prefix_length=3):
"""Generates (count, row) tuples.
Count for a row will be 1 if this is the first time we've seen it,
and increments with each duplicate row.
We assess duplicates by examining just an initial prefix of each row.
"""
prev = None
count = 0
for row in sheet:
prefix = row[:prefix_length]
if prefix != prev:
prev = prefix
count = 1
else:
count = 1
yield count, row
def main(infile: str = "input_file.tsv", n: int = 4):
with open(infile) as fin:
sheet = csv.reader(fin, delimiter="\t")
for count, row in get_annotated_rows(sheet):
if count <= n:
print("\t".join(row))
if __name__ == '__main__':
typer.run(main)
$ pip install typer
CodePudding user response:
A common shell scripting trick is to reformat data so it can be processed easily by the *nix utilities. Often the troublesome utility is the uniq
command, with it -f
(skip fields option), where fields are skipped at the front of the record. Many times you wish you could skip at the end of the record, so we rely on awk to reformat data to have the skipable fields at the end:
#!/bin/bash
tail -n 2 data.txt \
| awk '{print $4 " " $5 " " $1 "_" $2 "_" $3 }' \
| sort -k3 | uniq -cf2 \
| awk '$1<3{
split($4,arr,"_")
for (i=1;i<=$1;i ) {
print arr[1]"\t" arr[2]"\t" arr[3]"\t" $2 " " $3
}
}'
Output
chr21 10464942 10464942 T C
chr21 10464942 10464942 T C
chr21 10465086 10465086 T C
You can change the field separators in the print statement as needed for your final input.
(And this code can be folded up onto one line, giving the much desired (if misvalued) "oneliner" (-: ). IHTH