I have a series of samples that will be names unique header. For instance the sample could be labeled (SC892138_CTGAAGCT-ACTCTGAG or SC892138_unisample)_L001_001.star_rg_added.sorted.dmark.bam. Or(SC892155_CTGAAGCT-ACTCTGAG or SC892155_unisample)_L001_001.star_rg_added.sorted.dmark.bam. Or some other number after SC. I can parse out SC###### value for each sample, but I want to get to this:SC892155_CTGAAGCT-ACTCTGAG. How do I parse this out?
import os
import glob
import itertools
import pandas
from collections import defaultdict
workdir: os.environ['PWD']
## ---- The parser may have to be customized for each run ---- ##
def parse_sampleID(filename):
return filename.split('/')[-1].split('_')[0]
fastqs = glob.glob('/P/A/T/H/S/*star_rg_added.sorted.dmark.bam')
d = defaultdict(list)
for key, value in itertools.groupby(fastqs, parse_sampleID):
d[key] = list(value)
# Need to modify sampleIDs to bams not R1/R2 files
sampleIDs = d.keys()
CodePudding user response:
Maybe you aren't sharing enough examples to illustrate the problem adequately?
Option #1 based on provided information:
paths_n_fns = [
"/P/A/T/H/S/SC892138_CTGAAGCT-ACTCTGAG_L001_001.star_rg_added.sorted.dmark.bam",
"/P/A/T/H/S/SC892138_unisample_L001_001.star_rg_added.sorted.dmark.bam",
"/P/A/T/H/S/SC892155_CTGAAGCT-ACTCTGAG_L001_001.star_rg_added.sorted.dmark.bam",
"/P/A/T/H/S/SC892155_unisample_L001_001.star_rg_added.sorted.dmark.bam"
]
def parse_sampleID(path_n_filename):
if "unisample" in path_n_filename:
return path_n_filename.split('/')[-1].split('_')[0]
return path_n_filename.split('/')[-1].split('_L')[0]
sample_ids = []
for paths_n_fn in paths_n_fns:
sample_ids.append(parse_sampleID(paths_n_fn))
sample_ids
Output from option #1
['SC892138_CTGAAGCT-ACTCTGAG',
'SC892138',
'SC892155_CTGAAGCT-ACTCTGAG',
'SC892155']
Option #2 based on provided information:
paths_n_fns = [
"/P/A/T/H/S/SC892138_CTGAAGCT-ACTCTGAG_L001_001.star_rg_added.sorted.dmark.bam",
"/P/A/T/H/S/SC892138_unisample_L001_001.star_rg_added.sorted.dmark.bam",
"/P/A/T/H/S/SC892155_CTGAAGCT-ACTCTGAG_L001_001.star_rg_added.sorted.dmark.bam",
"/P/A/T/H/S/SC892155_unisample_L001_001.star_rg_added.sorted.dmark.bam"
]
def parse_sampleID(path_n_filename):
if "unisample" in path_n_filename:
return path_n_filename.split('/')[-1].split('_')[0]
parts = path_n_filename.split('/')[-1].split('_',2)
return "_".join([parts[0],parts[1]])
sample_ids = []
for paths_n_fn in paths_n_fns:
sample_ids.append(parse_sampleID(paths_n_fn))
sample_ids
Output from option #2
['SC892138_CTGAAGCT-ACTCTGAG',
'SC892138',
'SC892155_CTGAAGCT-ACTCTGAG',
'SC892155']