I have multiple files in a directory with the format gene_primer_otherinfo.fastq.gz. I have a csv file with the following format which states the gene and primer combinations I wish to keep. It has 64 rows in total.
| Gene | Reverse_primer |
| -------- | -------------- |
| Gene1 | R1.1 |
| Gene1 | R2.1 |
| Gene1 | R3.1 |
| Gene1 | R4.1 |
| Gene2 | R1.2 |
| Gene2 | R2.2 |
I would like to move the files which have gene/primer combinations not listed in the csv file e.g. Gene1-R2.2 into another folder, keeping only the ones that have the correct gene-primer combinations. For now I just want the code to identify valid and invalid combinations. However my code is accepting every combination as valid as it is not linking columns by row in the csv file, I believe it is looking in the columns and accepting as long as the gene and primer are present. I've checked the splits and they work fine.
Here is my code. I'm very new to python so any help would be much appreciated.
EDIT: example filenames which I want to keep:
gene1-R1.1_ID_R1.fastq.gz
gene1-R2.1_ID_R1.fastq.gz
gene2-R1.2_ID_R1.fastq.gz
Filenames I want to remove:
gene1-R1.2_ID_R1.fastq.gz;
gene1-R2.2_ID_R1.fastq.gz;
gene2-R1.1_ID_R1.fastq.gz;
The script is saved and run from the directory above the files (I have edited the code to include the line which states which folder search). When I run the code, it prints the information I ask for (file, gene, primer) and then every file name is listed as 'okay'
import pandas as pd
import os, re
import numpy as np
folder = "sandbox6"
os.listdir(folder)
#read csv with primer combinations
bc = pd.read_csv("primer_combinations.csv")
exceptions = []
for root, dirs, files in os.walk(folder):
for file in files:
try:
split_fname = re.split("(\w )-(\w \.\d)(_\w _)(R\d).fastq.gz",file)
gene,primer = split_fname[1], split_fname[2]
#print(file,gene,primer)
if [(gene == (bc['Gene'])) & (primer == (bc['Reverse_primer']))]:
print(file,gene,primer,'okay')
else:
print(file,gene,primer,'invalid combination')
continue
except:
exceptions = [file]
continue
exceptions
CodePudding user response:
Assuming the fact that your files start with g
and your .csv values start with G
(e.g. gene1
vs Gene1
) is just a typo, the problem is really with this one line:
if [(gene == (bc['Gene'])) & (primer == (bc['Reverse_primer']))]:
That line has a lot of decoration that should not be there, and it will always be True
, since the result of that expression is always going to be a non-empty list, due to the outer []
.
However, even if that wasn't the case, gene == (bc['Gene'])
will never be True
, as gene
won't be equal that series (you want to check if it is in the series, not equal to it). And &
doesn't do what you think it does, here you're going for boolean logic, so you'd want and
instead.
So might try something like this instead:
import re
from os import walk
from pandas import DataFrame, read_csv
folder = "sandbox6"
bc = read_csv("primer_combinations.csv")
exceptions = []
for root, dirs, files in walk(folder):
for file in files:
try:
split_fname = re.split("(\w )-(\w \.\d)(_\w _)(R\d).fastq.gz",file)
gene,primer = split_fname[1], split_fname[2]
if gene in bc['Gene'] and primer in bc['Reverse_primer']:
print(file, gene, primer, 'okay')
else:
print(file, gene, primer, 'invalid combination')
continue
except Exception as e:
print(e)
exceptions = [file]
continue
print(exceptions)
However, that also doesn't work, since gene in bc['Gene']
doesn't work as you might expect, that just checks if the value of gene
is an index in the series, not in its values. And it would fail because it no longer pairs up the values, it just tries to check if the value is present in either column (any combination).
So instead, you could do this:
import re
from os import walk
from pandas import DataFrame, read_csv
folder = "sandbox6"
bc = read_csv("primer_combinations.csv")
exceptions = []
combinations = set(zip((bc['Gene']), set(bc['Reverse_primer'])))
for root, dirs, files in walk(folder):
for file in files:
try:
split_fname = re.split("(\w )-(\w \.\d)(_\w _)(R\d).fastq.gz", file)
gene, primer = split_fname[1], split_fname[2]
if (gene, primer) in combinations:
print(file, gene, primer, 'okay')
else:
print(file, gene, primer, 'invalid combination')
continue
except Exception as e:
print(e)
exceptions = [file]
continue
print(exceptions)
This works. However, there's ways to rewrite your code that makes more efficient use of what pandas
has to offer, if you need to process very large collections of files (or need to do this very often). But that would basically be a complete rewrite of your code, and go well beyond the scope of the question.
CodePudding user response:
Pandas is not a good fit for your need: it's meant for manipulating columns ("series" of data), not iterating rows. Python's CSV module is all you need for reading the CSV file.
After that, you need to figure out a good way to read the CSV (hopefully once) and repeatedly check the data for each file you're testing. A simple dict with each Gene-Primer as a key will give you that really easy and fast lookup.
I mocked up your CSV as ref.csv:
Gene,Reverse_primer
Gene1,R1.1
Gene1,R2.1
Gene1,R3.1
Gene1,R4.1
Gene2,R1.2
Gene2,R2.2
and when I run this code:
import csv
import pprint
file_ref = {}
with open("ref.csv", newline="") as f:
reader = csv.reader(f)
next(reader) # discard header
for row in reader:
ref_name = row[0].lower() "-" row[1]
file_ref[ref_name] = None
pprint.pprint(file_ref, sort_dicts=False)
I get:
{'gene1-R1.1': None,
'gene1-R2.1': None,
'gene1-R3.1': None,
'gene1-R4.1': None,
'gene2-R1.2': None,
'gene2-R2.2': None}
Each row is just a Python list of strings. In this case each (row) list only has two items: the first is the first column, Gene; the second item is Reverse_primer. So row[0].lower() "-" row[1]
lowercases Gene and joins it (with a hyphen) with Reverse_primer. And I picked this format so that it's already in the form of the filenames, to make for less work during the parsing/lookup.
Now all you need is a function to break apart a filename and then check that ref dict to see if the filename matches:
def filename_in_ref(fname: str) -> bool:
ref_part = fname.split("_")[0]
# Use Python's "if key in dict" syntax for a very simple and quick check
if ref_part in file_ref:
return True
return False
And here's how I tested this worked, according to your guidelines for files to keep or remove:
to_keep = [
"gene1-R1.1_ID_R1.fastq.gz",
"gene1-R2.1_ID_R1.fastq.gz",
"gene2-R1.2_ID_R1.fastq.gz",
]
for fname in to_keep:
assert filename_in_ref(fname) == True
to_remove = [
"gene1-R1.2_ID_R1.fastq.gz",
"gene1-R2.2_ID_R1.fastq.gz",
"gene2-R1.1_ID_R1.fastq.gz",
]
for fname in to_remove:
assert filename_in_ref(fname) == False
and those asserts pass.