Change format and combine info from 2 columns-CodePudding

First time posting here so please be patient!

I have a file that looks like that:

POS {ALLELE:COUNT}  
1   G:27    A:11
2   C:40    T:0
3   C:40    A:0
4   T:40    G:0
5   G:0 C:40
6   C:40    T:0
7   G:24    A:14
8   G:40    A:0
9   A:40    G:0
...

I want to combine the information from the second and third column for each line in the following format: "number[A],number[C],number[G],number[T]" so that the example above would look like that:

POS {ALLELE:COUNT}
1   11,0,27,0
2   0,40,0,0
3   0,40,0,0
4   0,0,0,40
5   0,40,0,0
6   0,40,0,0
7   14,0,24,0
8   0,0,40,0
9   40,0,0,0
...

Any idea on how I could do that would be much appreciated!

CodePudding user response：

Here's a method that works:

lines = open('test.txt','r').read().splitlines()

place = {'A':0,'C':1,'G':2,'T':3}
counts = [[0 for _ in range(4)] for _ in range(len(lines[1:]))]
for i,row in enumerate(lines[1:]):
    for ct in row.split()[1:]:
        a,b = ct.split(':')
        counts[i][place[a]] = int(b)

out_str = '\n'.join([lines[0]]   ['{:<4}{},{},{},{}'.format(i 1,*ct) 
    for i,ct in enumerate(counts)])

with open('output.txt','w') as f:
    f.write(out_str)

The resulting file reads

POS {ALLELE:COUNT}  
1   11,0,27,0
2   0,40,0,0
3   0,40,0,0
4   0,0,0,40
5   0,40,0,0
6   0,40,0,0
7   14,0,24,0
8   0,0,40,0
9   40,0,0,0

CodePudding user response：

I assume that you file is regular a text file (not a csv or delimited file) and that G:27 A:11 is a line of this text file. for each line you can do as follows (we will take the first line as an example): remove useless spaces using strip G:27 A:11.strip() gives G:27 A:11, then split on blackspaces to obtain ['G:27','A:11']. Then for each element of this list split on : to get the allele type and its count. Alltogether it would look like

resulting_table=[]
for line in file: #You can easily find how to read a file line by line
    split_line=line.strip().split(' ')
    A,T,G,C=0,0,0,0
    for pair in split_line:
        element=pair.split(':')
        if element[0]=='A':
             A=element[1]
        elif element[0]=='T':
             ...
    resulting_table.append([A,T,G,C])

And here you go ! You can then transform it easily into a dataframe or a numpy array

This is absolutely not the most efficient nor elegant way to get your desired output, but it is clear and understandable for a python beginner

CodePudding user response：

sourse.txt

POS {ALLELE:COUNT}
1   G:27    A:11
2   C:40    T:0
3   C:40    A:0
4   T:40    G:0
5   G:0 C:40
6   C:40    T:0
7   G:24    A:14
8   G:40    A:0
9   A:40    G:0

import re

template = ('A', 'C', 'G', 'T')


def proc_line(line: str):
    
    index, *elements = re.findall(r'\S ', line)
    
    data = dict([*map(lambda x: x.split(':'), elements)])
    
    return f'{index}\t'   ','.join([data.get(item, '0') for item in template])   '\n'


with open('source.txt', encoding='utf-8') as file:
    header, *lines = file.readlines()

with open('output.txt', 'w', encoding='utf-8') as new_file:
    new_file.writelines(
        [header]   list(map(proc_line, lines))
    )

output.txt

POS {ALLELE:COUNT}  
1   11,0,27,0
2   0,40,0,0
3   0,40,0,0
4   0,0,0,40
5   0,40,0,0
6   0,40,0,0
7   14,0,24,0
8   0,0,40,0
9   40,0,0,0

CodePudding user response：

$ awk -F'[ :] ' '
    NR>1 {
        delete f
        f[$2] = $3
        f[$4] = $5
        $0 = sprintf("%s   %d,%d,%d,%d", $1, f["A"], f["C"], f["G"], f["T"])
   }
1' file
POS {ALLELE:COUNT}
1   11,0,27,0
2   0,40,0,0
3   0,40,0,0
4   0,0,0,40
5   0,40,0,0
6   0,40,0,0
7   14,0,24,0
8   0,0,40,0
9   40,0,0,0