How to dynamically join two CSV files?-CodePudding

I have two csv files and I was thinking about combining them via python - to practice my skill, and it turns out much more difficult than I ever imagined...

A simple conclusion of my problem: I feel like my code should be correct but the edited csv file turns out not to be what I thought.

One file, which I named as chrM_location.csv is the file that I want to edit.

The first file looks like this

The other file, named chrM_genes.csv is the file that I take reference at.

The second file looks like this:

The second file

There are a few other columns but I'm not using them at the moment. The first few roles are subject "CDS", then there is a blank row, followed by a few other roles with subject "exon", then another blank row, followed by some rows "genes" (and a few others).

What I tried to do is, I want to read first file row by row, focus on the number in the second column (42 for row 1 without header), see if it belongs to the range of 4-5 columns in file two (also read row by row), then if it is, I record the information of that corresponding row, and paste it back to the first file, at the end of the row, if not, I skip it.

below is my code, where I set out to run everything through the CDS section first, so I wrote a function refcds(). It returns me with:

whether or not the value is in range;
if in range, it forms a list of the information I want to paste to the second file.

Everything works fine for the main part of the code, I have the list final[] containing all information of that row, supposedly I only need to past it on that row and overwrite everything before. I used print(final) to check the info and it seems like just what I want it to be.

but this is what the result looks like: result

I have no idea why a new row is inserted and why some rows are pasted here together, when column 2 is supposedly small -> large according to value.

similar things happened in other places as well.

Thank you so much for your help! I'm running out of solution... No error messages are given and I couldn't really figure out what went wrong.

import csv
from csv import reader
from csv import writer
mylist=[]
a=0
final=[]
def refcds(value):
    mylist=[]
    with open("chrM_genes.csv", "r") as infile:
        r = csv.reader(infile)
        for rows in r:
            for i in range(0,12):
                if value >= rows[3] and value <= rows[4]:
                    mylist = ["CDS",rows[3],rows[4],int(int(value)-int(rows[3]) 1)]
                    return 0, mylist
                else:
                    return 1,[]
                
with open('chrM_location.csv','r ') as myfile:
    csv_reader = csv.reader(myfile)
    csv_writer = csv.writer(myfile)
    for row in csv_reader:
        if (row[1]) != 'POS':
            final=[]
            a,mylist = refcds(row[1]) 
            if a==0:
                lista=[row[0],row[1],row[2],row[3],row[4],row[5]]
                final.extend(lista)
                final.extend(mylist),
                csv_writer.writerow(final)
            if a==1:
                pass
        if (row[1]) == 'END':
            break
    myfile.close()```

CodePudding user response：

If I understand correctly - your code is trying to read and write to the same file at the same time.

csv_reader = csv.reader(myfile)
csv_writer = csv.writer(myfile)

I haven't tried your code: but I'm pretty sure this is going to cause weird stuff to happen... (If you refactor and output to a third file - do you still see the same issue?)

CodePudding user response：

I think the problem is that you have your reader and writer set to the same file—I have no idea what that does. A much cleaner solution is to accumulate your modified rows in the read loop, then once you're out of the read loop (and have closed the file), open the same file for writing (not appending) and write your accumulated rows.

I've made the one big change that fixes the problem.

You also said you were trying to improve your Python, so I made some other changes that are more pythonic.

import csv

# Return a matched list, or return None
def refcds(value):
    with open('chrM_genes.csv', 'r', newline='') as infile:
        reader = csv.reader(infile)
        for row in reader:
            if value >= row[3] and value <= row[4]:
                computed = int(value)-int(row[3]) 1  # probably negative??
                mylist = ['CDS', row[3], row[4], computed]
                return mylist

    return None  # if we get to this return, we've evaluated every row and didn't already return (because of a match)

# Accumulate rows here
final_rows = []
with open('chrM_location.csv', 'r', newline='') as myfile:
    reader = csv.reader(myfile)
    # next(reader)  ## if you know your file has a header

    for row in reader:
        # Show unusual conditions first...
        if row[1] == 'POS':
            continue  # skip header??

        if row[1] == 'END':
            break
        
        # ...and if not met, do desired work
        mylist = refcds(row[1])
        if mylist is not None:
            # no need to declare an empty list and then extend it
            # just create it with initial items...
            final = row[0:6]  # use slice notation to get a subset of a list (6 non-inclusive, so only to 5th col)
            final.extend(mylist)
            final_rows.append(final)


# Write accumulated rows here
with open('final.csv', 'w', newline='') as finalfile:
    writer = csv.writer(finalfile)
    writer.writerows(final_rows)

I also tried to figure out the whole thing, and came up with the following...

I think you want to look up rows of chrM_genes by Subject and compare a POS (from chrM_locaction) against Start and End bound for each gene, if POS is within the range of Start and End, return the chrM_gene data and fill in some empty cells already in chrM_location.

My first step would be to create a data structure from chrM_genes, since we'll be reading from that over and over again. Reading a bit into your problem, I can see the need to "filter" the results by subject ('CDS','exon', etc...), but I'm not sure of this. Still, I'm going to index this data structure by subject:

import csv

from collections import defaultdict

# This will create a dictionary, where subject will be the key
# and the value will be a list (of chrM (gene) rows)
chrM_rows_by_subject = defaultdict(list)

# Fill the data structure
with open('chrM_genes.csv', newline='') as f:
    reader = csv.reader(f)
    next(reader)  # read (skip) header

    subject_col = 2
    for row in reader:
        # you mentioned empty rows, that divide subjects, so skip empty rows
        if row == []:
            continue

        subject = row[subject_col]
        chrM_rows_by_subject[subject].append(row)

I mocked up chrM_genes.csv (and added a header, to try and clarify the structure):

Col1,Col2,Subject,Lower,Upper
chrM,ENSEMBL,CDS,3307,4262
chrM,ENSEMBL,CDS,4470,5511
chrM,ENSEMBL,CDS,5904,7445
chrM,ENSEMBL,CDS,7586,8266

chrM,ENSEMBL,exon,100,200
chrM,ENSEMBL,exon,300,400
chrM,ENSEMBL,exon,700,750

Just printing the data structure to get an idea of what it's doing:

import pprint

pprint.pprint(chrM_rows_by_subject)

yields:

defaultdict(<class 'list'>,
            {'CDS': [['chrM', 'ENSEMBL', 'CDS', '3307', '4262'],
                     ['chrM', 'ENSEMBL', 'CDS', '4470', '5511'],
                     ...
                    ],
             'exon': [['chrM', 'ENSEMBL', 'exon', '100', '200'],
                      ['chrM', 'ENSEMBL', 'exon', '300', '400'],
                      ...
                     ],
            })

Next, I want a function to match a row by subject and POS:

# Return a row that matches `subject` with `pos` between Start and End; or return None.
def match_gene_row(subject, pos):
    rows = chrM_rows_by_subject[subject]

    pos = int(pos)
    start_col = 3
    end_col   = 4
    for row in rows:
        start = row[start_col])
        end   = row[end_col])

        if pos >= start and pos <= end:
            # return just the data we want...
            return row

    # or return nothing at all
    return None

If I run these commands to test:

print(match_gene_row('CDS', '42'))
print(match_gene_row('CDS', '4200'))
print(match_gene_row('CDS', '7586'))
print(match_gene_row('exon', '500'))
print(match_gene_row('exon', '399'))

I get :

['chrM', 'ENSEMBL', 'CDS', '3307', '4262']
['chrM', 'ENSEMBL', 'CDS', '3307', '4262']
['chrM', 'ENSEMBL', 'CDS', '7586', '8266']
None                                        # exon: 500
['chrM', 'ENSEMBL', 'exon', '300', '400']

Read chrM_location.csv, and build a list of rows with matching gene data.

final_rows = []  # accumulate all rows here, for writing later

with open('chrM_location.csv', newline='') as f:
    reader = csv.reader(f)

    # Modify header
    header = next(reader)
    header.extend(['CDS','Start','End','cc'])
    final_rows.append(header)

    # Read rows and match to genes
    pos_column = 1
    for row in reader:
        pos = row[pos_column]
        matched_row = match_gene_row('CDS', pos)  # hard-coded to CDS

        if matched_row is not None:
            subj, start, end = matched_row[2:5]
            computed = str(int(pos)-int(start) 1) # this is coming out negative??
            row.extend([subj, start, end, computed])
            final_rows.append(row)

Finally, write.

with open('final.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(final_rows)

I mocked up chrM_location.csv:

name,POS,id,Ref,ALT,Frequency
chrM,41,.,C,T,0.002498
chrM,42,rs377245343,T,TC,0.001562
chrM,55,.,TA,T,0.00406
chrM,55,.,T,C,0.001874

When I run the whole thing, I get a final.csv that looks likes this:

name	POS	id	Ref	ALT	Frequency	CDS	Start	End	sequence_cc
chrM	41	.	C	T	0.002498	CDS	3307	4262	-3265
chrM	42	rs377245343	T	TC	0.001562	CDS	3307	4262	-3264
chrM	55	.	TA	T	0.00406	CDS	4470	5511	-4414
chrM	55	.	T	C	0.001874	CDS	4470	5511	-4414

I put this all together in a Gist.