I have following txt file (only a fragment is given)
chr1_964906_A/G chr1:964906 G ENSG00000187961 ENST00000622660 Transcript intron_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_964939_G/A chr1:964939 A ENSG00000187961 ENST00000338591 Transcript intron_variant - - - - - - IMPACT=MODIFIER;STRAND=1
chr1_964939_G/A chr1:964939 A ENSG00000187583 ENST00000379407 Transcript upstream_gene_variant - - - - - - IMPACT=MODIFIER;DISTANCE=1563;STRAND=1
chr1_964939_G/A chr1:964939 A ENSG00000187583 ENST00000379409 Transcript upstream_gene_variant - - - - - -
with many unknown various ENSG numbers, such as ENSG00000187583, etc. The count of integers in each ENSG string is 11.
I have to count how many intron_variant and upstream_gene_variant contains each gene (ENSGxxx). and output it to csv file.
I use dictionary for this purpose. i tried to write this code, but not sure about correct syntax. The logics should be: if these 11 numbers are not in dictionary, it should be added with value 1. If they already are in dictionary, value should be changed to x 1. I currently have this code, but I am not really Python programmer, and not sure about correct syntax.
with open(file, 'rt') as f:
data = f.readlines()
Count = 0
d = {}
for line in data:
if line[0] == "#":
output.write(line)
if line.__contains__('ENSG'):
d[line.split('ENSG')[1][0:11]]=1
if 1 in d:
d=1
else:
Count = 1
Any suggestions?
Thank you!
CodePudding user response:
Can you try this:
from collections import Counter
with open('data.txt') as fp:
ensg = []
for line in fp:
idx = line.find('ENSG')
if not line.startswith('#') and idx != -1:
ensg.append(line[idx 4:idx 15])
count = Counter(ensg)
>>> count
Counter({'00000187961': 2, '00000187583': 2})
CodePudding user response:
Here's another interpretation of your requirement:-
I have modified your sample data such that the first ENG value is ENSG00000187971 to highlight how this works.
D = {}
with open('eng.txt') as eng:
for line in eng:
if not line.startswith('#'):
t = line.split()
V = t[6]
E = t[3]
if not V in D:
D[V] = {}
if not E in D[V]:
D[V][E] = 1
else:
D[V][E] = 1
print(D)
The output of this is:-
{'intron_variant': {'ENSG00000187971': 1, 'ENSG00000187961': 1}, 'upstream_gene_variant': {'ENSG00000187583': 2}}
So what you have now is a dictionary keyed by variant. Each variant has its own dictionary keyed by the ENSG values and a count of occurrences of each ENSG value