Count pattern matches in paragraphs separated by empty lines in Python-CodePudding

I want to count matches in rows that have a pattern TRP or PHE or MET - I need to count it per paragraph (separated by empty lines). Then I would like to calculate the percentage of the matches by dividing the matches count by the number of lines in each paragraph. Is there a quick python solution for this?

My input looks like:

THR 61  65.21
LEU 62  63.85
PRO 63  54.61
LEU 64  50.74
ALA 65  57.40
PRO 66  56.49
ASP 67  56.77
PRO 68  55.94
TYR 69  56.06
PRO 70  56.55
GLY 71  57.74
HIS 72  55.69
ASN 73  64.70
PRO 74  65.70
        
ASP 422 65.05
SER 423 53.19
SER 424 45.39
ARG 425 47.80
ALA 426 48.84
ARG 427 46.19
ALA 428 46.81
SER 429 51.64
GLY 430 56.53
GLY 431 69.14
        
ASP 471 59.01
VAL 472 51.82
ASP 473 52.63
GLN 474 45.86
LEU 475 44.30
SER 476 45.83
LEU 477 45.78
THR 478 37.91
PRO 479 44.77
VAL 480 41.47
VAL 481 46.86
PRO 482 46.12
GLY 483 46.38
PRO 484 49.42
PRO 485 57.74

I tried with awk but it is too hard...

CodePudding user response：

This should do the trick, assuming your input is a txt file. Even though your input is not a text file you can load the input accordingly.

def calc_percetage(log, line_count):
    for pattern, sum in log.items():
        percentage[pattern] = (sum/line_count)*100
    return percentage

#log = {'TRP': 0, 'THR': 0, 'PRO': 0} This method can be used if number of patterns are less
log =dict()
for patterns in ['TRP', 'THR', 'PRO']:
    log[patterns] = 0
para = 1
percentage ={}
count = 0
with open("input.txt") as input_file:
    for line in input_file:
        count  =1
        for pattern, sum in log.items():
            if pattern in line:
                log[pattern]  = 1
        if (re.match('\r?\n', line)):
            line_count = count -1
            print(f"end of para {para} & number of lines {line_count}")
            print(f"count from paragraph {para} is {log}")
            percentage = calc_percetage(log, line_count)
            print(f"percentages are as followed {percentage}")
            para = para  1
            #reset for next paragraph
            count = 0
            log = {'TRP': 0, 'THR': 0, 'PRO': 0}  #This will change if you use dynamic way to generate the dict called 'log', you can reuse the for loop initially used to create dict
                
    #handaling last paragraph
    line_count = count
    print(f"end of para {para} & number of lines {line_count}")
    print(f"count from paragraph {para} is {log}")
    percentage = calc_percetage(log, line_count)
    print(f"percentages are as followed {percentage}")

CodePudding user response：

This task is straight forward in awk if the record separator is set to read paragraphs (one or more blank lines between lines) using RS="" (special meaning explained towards the bottom of this page of the awk manual: https://www.gnu.org/software/gawk/manual/html_node/awk-split-records.html), and the field separator is set to read lines as fields using FS="\n". In my example I have set these in a BEGIN block but shell switches could be used also.

Once the fields are configured, pattern blocks are established for each search pattern. The action of each is to increment a counter (action only applied when pattern is present). A final universal block can print the count and the number of fields/lines (NF) for that record, and perform whatever arithmetic is required with them.

awk procedure run on file.txt:

awk ' BEGIN {RS="";FS="\n";} /TRP/{aa  ;} /PHE/{aa  } /MET/{aa  } {print "set " NR": " 0 aa " matches in " NF " lines, Ratio=" (0 aa)/NF; aa=0}' file.txt

Note that the patterns are separated into distinct blocks to make sure the counter is incremented more than once for more than one match - if a combined or (|) pattern had been used, the count would only increase once if two matches were present.

output

set 1: 0 matches in 14 lines, Ratio=0
set 2: 0 matches in 10 lines, Ratio=0
set 3: 0 matches in 15 lines, Ratio=0

If totals for the file are required, a second counter variable can be added to each block that is not reset in the last block, along with a counter to accumulate the NF count for each record. In such a case, an END block can be used to sum and calculate overall ratios.