I want to count matches in rows that have a pattern TRP or PHE or MET - I need to count it per paragraph (separated by empty lines). Then I would like to calculate the percentage of the matches by dividing the matches count by the number of lines in each paragraph. Is there a quick python solution for this?
My input looks like:
THR 61 65.21
LEU 62 63.85
PRO 63 54.61
LEU 64 50.74
ALA 65 57.40
PRO 66 56.49
ASP 67 56.77
PRO 68 55.94
TYR 69 56.06
PRO 70 56.55
GLY 71 57.74
HIS 72 55.69
ASN 73 64.70
PRO 74 65.70
ASP 422 65.05
SER 423 53.19
SER 424 45.39
ARG 425 47.80
ALA 426 48.84
ARG 427 46.19
ALA 428 46.81
SER 429 51.64
GLY 430 56.53
GLY 431 69.14
ASP 471 59.01
VAL 472 51.82
ASP 473 52.63
GLN 474 45.86
LEU 475 44.30
SER 476 45.83
LEU 477 45.78
THR 478 37.91
PRO 479 44.77
VAL 480 41.47
VAL 481 46.86
PRO 482 46.12
GLY 483 46.38
PRO 484 49.42
PRO 485 57.74
I tried with awk but it is too hard...
CodePudding user response:
This should do the trick, assuming your input is a txt file. Even though your input is not a text file you can load the input accordingly.
def calc_percetage(log, line_count):
for pattern, sum in log.items():
percentage[pattern] = (sum/line_count)*100
return percentage
#log = {'TRP': 0, 'THR': 0, 'PRO': 0} This method can be used if number of patterns are less
log =dict()
for patterns in ['TRP', 'THR', 'PRO']:
log[patterns] = 0
para = 1
percentage ={}
count = 0
with open("input.txt") as input_file:
for line in input_file:
count =1
for pattern, sum in log.items():
if pattern in line:
log[pattern] = 1
if (re.match('\r?\n', line)):
line_count = count -1
print(f"end of para {para} & number of lines {line_count}")
print(f"count from paragraph {para} is {log}")
percentage = calc_percetage(log, line_count)
print(f"percentages are as followed {percentage}")
para = para 1
#reset for next paragraph
count = 0
log = {'TRP': 0, 'THR': 0, 'PRO': 0} #This will change if you use dynamic way to generate the dict called 'log', you can reuse the for loop initially used to create dict
#handaling last paragraph
line_count = count
print(f"end of para {para} & number of lines {line_count}")
print(f"count from paragraph {para} is {log}")
percentage = calc_percetage(log, line_count)
print(f"percentages are as followed {percentage}")
CodePudding user response:
This task is straight forward in awk
if the record separator is set to read paragraphs (one or more blank lines between lines) using RS=""
(special meaning explained towards the bottom of this page of the awk manual: https://www.gnu.org/software/gawk/manual/html_node/awk-split-records.html), and the field separator is set to read lines as fields using FS="\n"
. In my example I have set these in a BEGIN
block but shell switches could be used also.
Once the fields are configured, pattern blocks are established for each search pattern. The action of each is to increment a counter (action only applied when pattern is present). A final universal block can print the count and the number of fields/lines (NF
) for that record, and perform whatever arithmetic is required with them.
awk procedure run on file.txt
:
awk ' BEGIN {RS="";FS="\n";} /TRP/{aa ;} /PHE/{aa } /MET/{aa } {print "set " NR": " 0 aa " matches in " NF " lines, Ratio=" (0 aa)/NF; aa=0}' file.txt
Note that the patterns are separated into distinct blocks to make sure the counter is incremented more than once for more than one match - if a combined or (|
) pattern had been used, the count would only increase once if two matches were present.
output
set 1: 0 matches in 14 lines, Ratio=0
set 2: 0 matches in 10 lines, Ratio=0
set 3: 0 matches in 15 lines, Ratio=0
If totals for the file are required, a second counter variable can be added to each block that is not reset in the last block, along with a counter to accumulate the NF
count for each record. In such a case, an END
block can be used to sum and calculate overall ratios.