Reading two columns in csv and linking by row-CodePudding

I have multiple files in a directory with the format gene_primer_otherinfo.fastq.gz. I have a csv file with the following format which states the gene and primer combinations I wish to keep. It has 64 rows in total.

|  Gene    | Reverse_primer |
| -------- | -------------- |
| Gene1    | R1.1           |
| Gene1    | R2.1           |
| Gene1    | R3.1           |
| Gene1    | R4.1           |
| Gene2    | R1.2           |
| Gene2    | R2.2           |

I would like to move the files which have gene/primer combinations not listed in the csv file e.g. Gene1-R2.2 into another folder, keeping only the ones that have the correct gene-primer combinations. For now I just want the code to identify valid and invalid combinations. However my code is accepting every combination as valid as it is not linking columns by row in the csv file, I believe it is looking in the columns and accepting as long as the gene and primer are present. I've checked the splits and they work fine.

Here is my code. I'm very new to python so any help would be much appreciated.

EDIT: example filenames which I want to keep:

gene1-R1.1_ID_R1.fastq.gz

gene1-R2.1_ID_R1.fastq.gz

gene2-R1.2_ID_R1.fastq.gz

Filenames I want to remove:

gene1-R1.2_ID_R1.fastq.gz;

gene1-R2.2_ID_R1.fastq.gz;

gene2-R1.1_ID_R1.fastq.gz;

The script is saved and run from the directory above the files (I have edited the code to include the line which states which folder search). When I run the code, it prints the information I ask for (file, gene, primer) and then every file name is listed as 'okay'

import pandas as pd
import os, re
import numpy as np

folder = "sandbox6"
os.listdir(folder)

#read csv with primer combinations
bc = pd.read_csv("primer_combinations.csv")
exceptions = []
for root, dirs, files in os.walk(folder):
    for file in files:
        try:
            split_fname = re.split("(\w )-(\w \.\d)(_\w _)(R\d).fastq.gz",file)
            gene,primer = split_fname[1], split_fname[2]
            #print(file,gene,primer)
            if [(gene == (bc['Gene'])) & (primer == (bc['Reverse_primer']))]:
                print(file,gene,primer,'okay')
            else:
                print(file,gene,primer,'invalid combination')
                continue
        except:
            exceptions  = [file]
            continue
exceptions

CodePudding user response：

Assuming the fact that your files start with g and your .csv values start with G (e.g. gene1 vs Gene1) is just a typo, the problem is really with this one line:

            if [(gene == (bc['Gene'])) & (primer == (bc['Reverse_primer']))]:

That line has a lot of decoration that should not be there, and it will always be True, since the result of that expression is always going to be a non-empty list, due to the outer [].

However, even if that wasn't the case, gene == (bc['Gene']) will never be True, as gene won't be equal that series (you want to check if it is in the series, not equal to it). And & doesn't do what you think it does, here you're going for boolean logic, so you'd want and instead.

So might try something like this instead:

import re
from os import walk
from pandas import DataFrame, read_csv

folder = "sandbox6"

bc = read_csv("primer_combinations.csv")
exceptions = []
for root, dirs, files in walk(folder):
    for file in files:
        try:
            split_fname = re.split("(\w )-(\w \.\d)(_\w _)(R\d).fastq.gz",file)
            gene,primer = split_fname[1], split_fname[2]
            if gene in bc['Gene'] and primer in bc['Reverse_primer']:
                print(file, gene, primer, 'okay')
            else:
                print(file, gene, primer, 'invalid combination')
                continue
        except Exception as e:
            print(e)
            exceptions  = [file]
            continue
print(exceptions)

However, that also doesn't work, since gene in bc['Gene'] doesn't work as you might expect, that just checks if the value of gene is an index in the series, not in its values. And it would fail because it no longer pairs up the values, it just tries to check if the value is present in either column (any combination).

So instead, you could do this:

import re
from os import walk
from pandas import DataFrame, read_csv

folder = "sandbox6"

bc = read_csv("primer_combinations.csv")
exceptions = []
combinations = set(zip((bc['Gene']), set(bc['Reverse_primer'])))
for root, dirs, files in walk(folder):
    for file in files:
        try:
            split_fname = re.split("(\w )-(\w \.\d)(_\w _)(R\d).fastq.gz", file)
            gene, primer = split_fname[1], split_fname[2]
            if (gene, primer) in combinations:
                print(file, gene, primer, 'okay')
            else:
                print(file, gene, primer, 'invalid combination')
                continue
        except Exception as e:
            print(e)
            exceptions  = [file]
            continue
print(exceptions)

This works. However, there's ways to rewrite your code that makes more efficient use of what pandas has to offer, if you need to process very large collections of files (or need to do this very often). But that would basically be a complete rewrite of your code, and go well beyond the scope of the question.

CodePudding user response：

Pandas is not a good fit for your need: it's meant for manipulating columns ("series" of data), not iterating rows. Python's CSV module is all you need for reading the CSV file.

After that, you need to figure out a good way to read the CSV (hopefully once) and repeatedly check the data for each file you're testing. A simple dict with each Gene-Primer as a key will give you that really easy and fast lookup.

I mocked up your CSV as ref.csv:

Gene,Reverse_primer
Gene1,R1.1
Gene1,R2.1
Gene1,R3.1
Gene1,R4.1
Gene2,R1.2
Gene2,R2.2

and when I run this code:

import csv
import pprint

file_ref = {}
with open("ref.csv", newline="") as f:
    reader = csv.reader(f)
    next(reader)  # discard header
    for row in reader:
        ref_name = row[0].lower()   "-"   row[1]

        file_ref[ref_name] = None

pprint.pprint(file_ref, sort_dicts=False)

I get:

{'gene1-R1.1': None,
 'gene1-R2.1': None,
 'gene1-R3.1': None,
 'gene1-R4.1': None,
 'gene2-R1.2': None,
 'gene2-R2.2': None}

Each row is just a Python list of strings. In this case each (row) list only has two items: the first is the first column, Gene; the second item is Reverse_primer. So row[0].lower() "-" row[1] lowercases Gene and joins it (with a hyphen) with Reverse_primer. And I picked this format so that it's already in the form of the filenames, to make for less work during the parsing/lookup.

Now all you need is a function to break apart a filename and then check that ref dict to see if the filename matches:

def filename_in_ref(fname: str) -> bool:
    ref_part = fname.split("_")[0]
    # Use Python's "if key in dict" syntax for a very simple and quick check
    if ref_part in file_ref:
        return True
    return False

And here's how I tested this worked, according to your guidelines for files to keep or remove:

to_keep = [
    "gene1-R1.1_ID_R1.fastq.gz",
    "gene1-R2.1_ID_R1.fastq.gz",
    "gene2-R1.2_ID_R1.fastq.gz",
]

for fname in to_keep:
    assert filename_in_ref(fname) == True

to_remove = [
    "gene1-R1.2_ID_R1.fastq.gz",
    "gene1-R2.2_ID_R1.fastq.gz",
    "gene2-R1.1_ID_R1.fastq.gz",
]

for fname in to_remove:
    assert filename_in_ref(fname) == False

and those asserts pass.