Home > Net >  Write values of a dictionary one in each line with DictWriter
Write values of a dictionary one in each line with DictWriter

Time:07-23

So I have a Python dictionary with protein sequences and their ids. I wanted to convert that dictionary to a CSV file to upload it as a dataset to fine-tune a transformer. However, when I create the CSV it appears with the dictionary shape (key-value, key-value...).

What I want is the CSV to have one key and in the next line it's value, and that shape consecutively. Is there a way to add a \n or something like that to have it's key and value in one line?

Shape of the dictionary:

{'NavAb:/1126': 'TNIVESSFFTKFIIYLIVLNGITMGLETSKTFMQSFGVYTTLFNQIVITIFTIEIILRIYVHRISFFKDPWSLFDFFVVAISLVPTSSGFEILRVLRVLRLFRLVTAVPQMRKI', 'Shaker:/1656': 'SSQAARVVAIISVFVILLSIVIFCLETLEDEVPDITDPFFLIETLCIIWFTFELTVRFLACPLNFCRDVMNVIDIIAIIPYFITTLNLLRVIRLVRVFRIFKLSRHSKGLQIL', .....

What I want in the CSV:

protein id
protein sequence
protein id
protein sequence
.....

The code I have for the moment:

def parse_file(input_file):
  parsed_seqs = {}
  curr_seq_id = None
  curr_seq = []
  for line in newfile:
     line = line.strip()
     line = line.replace('-', '')
     if line.startswith(">"):
        if curr_seq_id is not None:
          parsed_seqs[curr_seq_id] = ''.join(curr_seq)
        curr_seq_id = line[1:]
        curr_seq = []
        continue

    curr_seq.append(line)
 parsed_seqs[curr_seq_id] = ''.join(curr_seq)
 return parsed_seqs

newfile = open("/content/drive/MyDrive/Colab Notebooks/seqs.fasta")
parsed_seqs = parse_file(newfile)

with open('sequences.csv', 'w', newline='') as f:
w = csv.DictWriter(f, parsed_seqs.keys())
w.writeheader()
w.writerow(parsed_seqs)

The shape I want:

enter image description here

New shape: enter image description here

CodePudding user response:

To get CSV output with 2 columns, one for Protein ID and one for Protein Sequence, you can do this.

parsed_seqs = {
  'NavAb:/1126': 'TNIVESS',
  'Shaker:/1656': 'SSQAARVV'
}

column_names = ["Protein ID", "Protein Sequence"]

with open('sequences.csv', 'w', newline='') as f:
    w = csv.writer(f, column_names)
    w.writerow(column_names)
    w.writerows(parsed_seqs.items())

Output:

Protein ID,Protein Sequence
NavAb:/1126,TNIVESS
Shaker:/1656,SSQAARVV

As an aside, the csv.DictWriter class works well when you have a list of dictionaries, where each dictionary is structured like {"column1": "value1", "column2": "value2"}. For example

parsed_seqs = [
  {"ID": "NavAb", "Seq": "TINVESS"},
  {"ID": "Shaker", "Seq": "SSQAARVV"}
]
with open("sequences.fa", "wt", newline="") as fd:
  wrtr = csv.DictWriter(fd, ["ID", "Seq"])
  wrtr.writeheader()
  wrtr.writerows(parsed_seqs)
  • Related