Array of values as input in Snakemake workflows-CodePudding

I started to migrate my workflows from Nextflow to Snakemake and already hitting the wall at the start of my pipelines which very often begin with a list of numbers (representing a "run number" from our detector).

What I have for example is a run-list.txt like

# detector_id run_number
75 63433
75 67325
42 57584
42 57899
42 58998

which then needs to be passed line by line to a process that queries a database or data storage system and retrieves a file to the local system.

This means that e.g. 75 63433 would generate the output RUN_00000075_00063433.h5 via a rule which receives detector_id=75 and run_number=63433 as input parameters.

With Nextflow this is fairly easy, just defining a process which emits a tuple of these values.

I don't quite understand how I can do something like this in Snakemake since it seems that inputs and outputs always needs to be files (remote or local). In fact, some of the files are indeed accessible via iRODS and/or XRootD but even then, I need to start with a run-selection first which is defined in a list like the run-list.txt above.

My question is now: what is the Snakemake-style approach to this problem?

A non-working pseudo-code would be:

rule:
    input:
        [line for line in open("run-list.txt").readlines()]
    output:
        "{detector_id}_{run_number}.h5"
    shell:
        "detector_id, run_number = line.split()"
        "touch "{detector_id}_{run_number}.h5""

CodePudding user response：

In Snakemake you'd use this file to generate lists of the values you want to feed into your workflow. You'd parse the detector IDs and run numbers outside the rules. Off the top of my head your run list looks like it could neatly be handled with pandas, if you want to use an external library.

import pandas as pd

run_list = pd.read_csv("run-list.txt", header=0, names=["detector_id", "run_number"], sep=" ")
detector_ids = list(run_list["detector_id"])
run_numbers = list(run_list["run_number"])

Then a rule for running what you want to do for getting one file would be, under the assumption your file names do not need to be zero-padded:

rule do_something:
    output: "{detector_id}_{run_number}.h5"
    shell: "do_something_with {wildcards.detector_id} {wildcards.run_number}"

With just this rule alone, detector_id and run_number could in theory be anything, so you'll need something to tell Snakemake to run this in a way that produces the output you want. To run this for all lines in your file, you'd therefore set up a rule that takes all potential outputs defined by the file as input.

rule run_all:
    input: expand("{detector_id}_{run_number}.h5", zip, detector_id=detector_ids, run_number=run_numbers)

with the zip part ensuring that the first detector ID goes with the first run number and so on.

Finally, you'd run it specifying the name of the rule you want to run, so snakemake run_all.

CodePudding user response：

To make this work you need two ingredients:

a rule that specifies the logic for generating a single file (defining any file dependencies, if necessary)
a rule that defines which file should be calculated, by convention this rule is called all.

Here is a rough sketch of the code:

def process_lines(file_name):
    """generates id/run, ignoring non-numeric lines"""
    with open(file_name, "r") as f:
        for line in f:
            detector_id, run_number, *_ = line.split()
            if detector_id.isnumeric() and run_number.isnumeric():
                detector_id = detector_id.zfill(8)
                run_number = run_number.zfill(8)
                yield detector_id, run_number


out_file_format = "{detector_id}_{run_number}.h5"
final_files = [
    out_file_format.format(detector_id=detector_id, run_number=run_number)
    for detector_id, run_number in process_lines("run-list.txt")
]


rule all:
    """Collect all outputs."""
    input:
        final_files,


rule:
    """Generate an output"""
    output:
        out_file_format,
    shell:
        """
        echo {wildcards[detector_id]}
        echo {wildcards[run_number]}
        echo {output}
        """

CodePudding user response：

Already some good answers but since I got the code in the meantime here's my 2p. Save as Snakefile and it should be runnable.

import pandas

# In reality you read this from file using pandas.read_csv.
# Or use a solution other than pandas dataframes.
run_list = [(75, 63433),
(75, 67325),
(42, 57584),
(42, 57899),
(42, 58998)]

run_list = pandas.DataFrame(run_list, columns= ['detector_id', 'run_id'])

rule all:
    input:
        expand('RUN_{detector_id}_{run_id}.h5', zip, detector_id= run_list.detector_id, run_id= run_list.run_id),

rule make_run:
    output:
        'RUN_{detector_id}_{run_id}.h5',
    shell:
        r"""
        touch {output}
        """

You would need some string manipulation for the zero-padding but that is a python thing, not snakemake's.