I have two csv files and I was thinking about combining them via python - to practice my skill, and it turns out much more difficult than I ever imagined...
A simple conclusion of my problem: I feel like my code should be correct but the edited csv file turns out not to be what I thought.
One file, which I named as chrM_location.csv
is the file that I want to edit.
The first file looks like this
The other file, named chrM_genes.csv
is the file that I take reference at.
The second file looks like this:
There are a few other columns but I'm not using them at the moment. The first few roles are subject "CDS", then there is a blank row, followed by a few other roles with subject "exon", then another blank row, followed by some rows "genes" (and a few others).
What I tried to do is, I want to read first file row by row, focus on the number in the second column (42 for row 1 without header), see if it belongs to the range of 4-5 columns in file two (also read row by row), then if it is, I record the information of that corresponding row, and paste it back to the first file, at the end of the row, if not, I skip it.
below is my code, where I set out to run everything through the CDS section first, so I wrote a function refcds()
. It returns me with:
- whether or not the value is in range;
- if in range, it forms a list of the information I want to paste to the second file.
Everything works fine for the main part of the code, I have the list final[]
containing all information of that row, supposedly I only need to past it on that row and overwrite everything before. I used print(final)
to check the info and it seems like just what I want it to be.
but this is what the result looks like:
I have no idea why a new row is inserted and why some rows are pasted here together, when column 2 is supposedly small -> large according to value.
similar things happened in other places as well.
Thank you so much for your help! I'm running out of solution... No error messages are given and I couldn't really figure out what went wrong.
import csv
from csv import reader
from csv import writer
mylist=[]
a=0
final=[]
def refcds(value):
mylist=[]
with open("chrM_genes.csv", "r") as infile:
r = csv.reader(infile)
for rows in r:
for i in range(0,12):
if value >= rows[3] and value <= rows[4]:
mylist = ["CDS",rows[3],rows[4],int(int(value)-int(rows[3]) 1)]
return 0, mylist
else:
return 1,[]
with open('chrM_location.csv','r ') as myfile:
csv_reader = csv.reader(myfile)
csv_writer = csv.writer(myfile)
for row in csv_reader:
if (row[1]) != 'POS':
final=[]
a,mylist = refcds(row[1])
if a==0:
lista=[row[0],row[1],row[2],row[3],row[4],row[5]]
final.extend(lista)
final.extend(mylist),
csv_writer.writerow(final)
if a==1:
pass
if (row[1]) == 'END':
break
myfile.close()```
CodePudding user response:
If I understand correctly - your code is trying to read and write to the same file at the same time.
csv_reader = csv.reader(myfile)
csv_writer = csv.writer(myfile)
I haven't tried your code: but I'm pretty sure this is going to cause weird stuff to happen... (If you refactor and output to a third file - do you still see the same issue?)
CodePudding user response:
I think the problem is that you have your reader and writer set to the same file—I have no idea what that does. A much cleaner solution is to accumulate your modified rows in the read loop, then once you're out of the read loop (and have closed the file), open the same file for writing (not appending) and write your accumulated rows.
I've made the one big change that fixes the problem.
You also said you were trying to improve your Python, so I made some other changes that are more pythonic.
import csv
# Return a matched list, or return None
def refcds(value):
with open('chrM_genes.csv', 'r', newline='') as infile:
reader = csv.reader(infile)
for row in reader:
if value >= row[3] and value <= row[4]:
computed = int(value)-int(row[3]) 1 # probably negative??
mylist = ['CDS', row[3], row[4], computed]
return mylist
return None # if we get to this return, we've evaluated every row and didn't already return (because of a match)
# Accumulate rows here
final_rows = []
with open('chrM_location.csv', 'r', newline='') as myfile:
reader = csv.reader(myfile)
# next(reader) ## if you know your file has a header
for row in reader:
# Show unusual conditions first...
if row[1] == 'POS':
continue # skip header??
if row[1] == 'END':
break
# ...and if not met, do desired work
mylist = refcds(row[1])
if mylist is not None:
# no need to declare an empty list and then extend it
# just create it with initial items...
final = row[0:6] # use slice notation to get a subset of a list (6 non-inclusive, so only to 5th col)
final.extend(mylist)
final_rows.append(final)
# Write accumulated rows here
with open('final.csv', 'w', newline='') as finalfile:
writer = csv.writer(finalfile)
writer.writerows(final_rows)
I also tried to figure out the whole thing, and came up with the following...
I think you want to look up rows of chrM_genes by Subject and compare a POS (from chrM_locaction) against Start and End bound for each gene, if POS is within the range of Start and End, return the chrM_gene data and fill in some empty cells already in chrM_location.
My first step would be to create a data structure from chrM_genes, since we'll be reading from that over and over again. Reading a bit into your problem, I can see the need to "filter" the results by subject ('CDS','exon', etc...), but I'm not sure of this. Still, I'm going to index this data structure by subject:
import csv
from collections import defaultdict
# This will create a dictionary, where subject will be the key
# and the value will be a list (of chrM (gene) rows)
chrM_rows_by_subject = defaultdict(list)
# Fill the data structure
with open('chrM_genes.csv', newline='') as f:
reader = csv.reader(f)
next(reader) # read (skip) header
subject_col = 2
for row in reader:
# you mentioned empty rows, that divide subjects, so skip empty rows
if row == []:
continue
subject = row[subject_col]
chrM_rows_by_subject[subject].append(row)
I mocked up chrM_genes.csv (and added a header, to try and clarify the structure):
Col1,Col2,Subject,Lower,Upper
chrM,ENSEMBL,CDS,3307,4262
chrM,ENSEMBL,CDS,4470,5511
chrM,ENSEMBL,CDS,5904,7445
chrM,ENSEMBL,CDS,7586,8266
chrM,ENSEMBL,exon,100,200
chrM,ENSEMBL,exon,300,400
chrM,ENSEMBL,exon,700,750
Just printing the data structure to get an idea of what it's doing:
import pprint
pprint.pprint(chrM_rows_by_subject)
yields:
defaultdict(<class 'list'>,
{'CDS': [['chrM', 'ENSEMBL', 'CDS', '3307', '4262'],
['chrM', 'ENSEMBL', 'CDS', '4470', '5511'],
...
],
'exon': [['chrM', 'ENSEMBL', 'exon', '100', '200'],
['chrM', 'ENSEMBL', 'exon', '300', '400'],
...
],
})
Next, I want a function to match a row by subject and POS:
# Return a row that matches `subject` with `pos` between Start and End; or return None.
def match_gene_row(subject, pos):
rows = chrM_rows_by_subject[subject]
pos = int(pos)
start_col = 3
end_col = 4
for row in rows:
start = row[start_col])
end = row[end_col])
if pos >= start and pos <= end:
# return just the data we want...
return row
# or return nothing at all
return None
If I run these commands to test:
print(match_gene_row('CDS', '42'))
print(match_gene_row('CDS', '4200'))
print(match_gene_row('CDS', '7586'))
print(match_gene_row('exon', '500'))
print(match_gene_row('exon', '399'))
I get :
['chrM', 'ENSEMBL', 'CDS', '3307', '4262']
['chrM', 'ENSEMBL', 'CDS', '3307', '4262']
['chrM', 'ENSEMBL', 'CDS', '7586', '8266']
None # exon: 500
['chrM', 'ENSEMBL', 'exon', '300', '400']
Read chrM_location.csv, and build a list of rows with matching gene data.
final_rows = [] # accumulate all rows here, for writing later
with open('chrM_location.csv', newline='') as f:
reader = csv.reader(f)
# Modify header
header = next(reader)
header.extend(['CDS','Start','End','cc'])
final_rows.append(header)
# Read rows and match to genes
pos_column = 1
for row in reader:
pos = row[pos_column]
matched_row = match_gene_row('CDS', pos) # hard-coded to CDS
if matched_row is not None:
subj, start, end = matched_row[2:5]
computed = str(int(pos)-int(start) 1) # this is coming out negative??
row.extend([subj, start, end, computed])
final_rows.append(row)
Finally, write.
with open('final.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(final_rows)
I mocked up chrM_location.csv:
name,POS,id,Ref,ALT,Frequency
chrM,41,.,C,T,0.002498
chrM,42,rs377245343,T,TC,0.001562
chrM,55,.,TA,T,0.00406
chrM,55,.,T,C,0.001874
When I run the whole thing, I get a final.csv that looks likes this:
name | POS | id | Ref | ALT | Frequency | CDS | Start | End | sequence_cc |
---|---|---|---|---|---|---|---|---|---|
chrM | 41 | . | C | T | 0.002498 | CDS | 3307 | 4262 | -3265 |
chrM | 42 | rs377245343 | T | TC | 0.001562 | CDS | 3307 | 4262 | -3264 |
chrM | 55 | . | TA | T | 0.00406 | CDS | 4470 | 5511 | -4414 |
chrM | 55 | . | T | C | 0.001874 | CDS | 4470 | 5511 | -4414 |
I put this all together in a Gist.