Counting items in txt file with Python dictionaries-CodePudding

I have following txt file (only a fragment is given)

chr1_964906_A/G chr1:964906     G       ENSG00000187961 ENST00000622660 Transcript      intron_variant  -       -       -       -       -       -       IMPACT=MODIFIER;STRAND=1
chr1_964939_G/A chr1:964939     A       ENSG00000187961 ENST00000338591 Transcript      intron_variant  -       -       -       -       -       -       IMPACT=MODIFIER;STRAND=1
chr1_964939_G/A chr1:964939     A       ENSG00000187583 ENST00000379407 Transcript      upstream_gene_variant   -       -       -       -       -       -       IMPACT=MODIFIER;DISTANCE=1563;STRAND=1
chr1_964939_G/A chr1:964939     A       ENSG00000187583 ENST00000379409 Transcript      upstream_gene_variant   -       -       -       -       -       -

with many unknown various ENSG numbers, such as ENSG00000187583, etc. The count of integers in each ENSG string is 11.

I have to count how many intron_variant and upstream_gene_variant contains each gene (ENSGxxx). and output it to csv file.

I use dictionary for this purpose. i tried to write this code, but not sure about correct syntax. The logics should be: if these 11 numbers are not in dictionary, it should be added with value 1. If they already are in dictionary, value should be changed to x 1. I currently have this code, but I am not really Python programmer, and not sure about correct syntax.

    with open(file, 'rt') as f:
        data = f.readlines()
        Count = 0
        d = {}
        for line in data:
            if line[0] == "#":
                output.write(line)
            if line.__contains__('ENSG'): 
                d[line.split('ENSG')[1][0:11]]=1
                if 1 in d:
                    d=1
                else:
                    Count  = 1

Any suggestions?

Thank you!

CodePudding user response：

Can you try this:

from collections import Counter

with open('data.txt') as fp:
    ensg = []
    for line in fp:
        idx = line.find('ENSG')
        if not line.startswith('#') and idx != -1:
            ensg.append(line[idx 4:idx 15])
count = Counter(ensg)

>>> count
Counter({'00000187961': 2, '00000187583': 2})

CodePudding user response：

Here's another interpretation of your requirement:-

I have modified your sample data such that the first ENG value is ENSG00000187971 to highlight how this works.

D = {}

with open('eng.txt') as eng:
    for line in eng:
        if not line.startswith('#'):
            t = line.split()
            V = t[6]
            E = t[3]
            if not V in D:
                D[V] = {}
            if not E in D[V]:
                D[V][E] = 1
            else:
                D[V][E]  = 1
print(D)

The output of this is:-

{'intron_variant': {'ENSG00000187971': 1, 'ENSG00000187961': 1}, 'upstream_gene_variant': {'ENSG00000187583': 2}}

So what you have now is a dictionary keyed by variant. Each variant has its own dictionary keyed by the ENSG values and a count of occurrences of each ENSG value