So I have a Python dictionary with protein sequences and their ids. I wanted to convert that dictionary to a CSV file to upload it as a dataset to fine-tune a transformer. However, when I create the CSV it appears with the dictionary shape (key-value, key-value...).
What I want is the CSV to have one key and in the next line it's value, and that shape consecutively. Is there a way to add a \n or something like that to have it's key and value in one line?
Shape of the dictionary:
{'NavAb:/1126': 'TNIVESSFFTKFIIYLIVLNGITMGLETSKTFMQSFGVYTTLFNQIVITIFTIEIILRIYVHRISFFKDPWSLFDFFVVAISLVPTSSGFEILRVLRVLRLFRLVTAVPQMRKI', 'Shaker:/1656': 'SSQAARVVAIISVFVILLSIVIFCLETLEDEVPDITDPFFLIETLCIIWFTFELTVRFLACPLNFCRDVMNVIDIIAIIPYFITTLNLLRVIRLVRVFRIFKLSRHSKGLQIL', .....
What I want in the CSV:
protein id
protein sequence
protein id
protein sequence
.....
The code I have for the moment:
def parse_file(input_file):
parsed_seqs = {}
curr_seq_id = None
curr_seq = []
for line in newfile:
line = line.strip()
line = line.replace('-', '')
if line.startswith(">"):
if curr_seq_id is not None:
parsed_seqs[curr_seq_id] = ''.join(curr_seq)
curr_seq_id = line[1:]
curr_seq = []
continue
curr_seq.append(line)
parsed_seqs[curr_seq_id] = ''.join(curr_seq)
return parsed_seqs
newfile = open("/content/drive/MyDrive/Colab Notebooks/seqs.fasta")
parsed_seqs = parse_file(newfile)
with open('sequences.csv', 'w', newline='') as f:
w = csv.DictWriter(f, parsed_seqs.keys())
w.writeheader()
w.writerow(parsed_seqs)
The shape I want:
CodePudding user response:
To get CSV output with 2 columns, one for Protein ID and one for Protein Sequence, you can do this.
parsed_seqs = {
'NavAb:/1126': 'TNIVESS',
'Shaker:/1656': 'SSQAARVV'
}
column_names = ["Protein ID", "Protein Sequence"]
with open('sequences.csv', 'w', newline='') as f:
w = csv.writer(f, column_names)
w.writerow(column_names)
w.writerows(parsed_seqs.items())
Output:
Protein ID,Protein Sequence
NavAb:/1126,TNIVESS
Shaker:/1656,SSQAARVV
As an aside, the csv.DictWriter
class works well when you have a list of dictionaries, where each dictionary is structured like {"column1": "value1", "column2": "value2"}
. For example
parsed_seqs = [
{"ID": "NavAb", "Seq": "TINVESS"},
{"ID": "Shaker", "Seq": "SSQAARVV"}
]
with open("sequences.fa", "wt", newline="") as fd:
wrtr = csv.DictWriter(fd, ["ID", "Seq"])
wrtr.writeheader()
wrtr.writerows(parsed_seqs)