Can a snakemake rule depend on data in the file instead of its change state-CodePudding

I have data in a CSV file that frequently changes. The CSV file is a source for a snakefile rule. My issue is that I want this rule to run only when a certain value appears in the data of the CSV file and not every time when the file changes. Is it possible to let rule execution depend on specific patterns in the file that has changed and not on the fact that it changed?

CodePudding user response：

An option could be to always run the conditional rule but running it in "dummy" mode if the value in your CSV is present. In mostly pseudocode:

rule run_if:
    input:
        'data.csv',
    output:
        'done.out',
    run:
        if 'some_value' is in data.csv:
            do stuff and write 'done.out'
        else:
            # Only update the file
            touch('done.out')

It depends what happens after done.out is created/updated though.

In general, it is better to think of snakemake in terms of a chain of input/output files rather than in terms of which rules you want to run.

CodePudding user response：

The specific check that Snakemake does to determine if a rule should be re-executed is based on timestamps (not file content), so first thing to do is to wrap relevant files in ancient.

Next, since the Snakefile is a Python file, it's possible to incorporate the required logic using pandas or some other library for handling csvs. Below is a rough idea:

import pandas as pd
csv_file = 'some_file.txt'
df = pd.read_csv(csv_file)
items_to_do = df.query('column_x>=10')['column_y'].values.tolist()

rule all:
    input: expand('file_out_{y}.txt', y=items_to_do)

rule some_rule:
    input: ancient('test.csv')
    output: 'file_out_{y}.txt'
    ... # code to generate the file

So if you update some_file.txt, but the values that are updated are associated with column_x being less than 10, then no new jobs will be executed.

Update: I assumed that the rule in question generates multiple files using wildcards, but re-reading the question this doesn't seem to be the case. If it's just a single rule, then the snippet above can be modified to work along these lines:

import pandas as pd
csv_file = 'some_file.txt'

def file_is_updated():
    df = pd.read_csv(csv_file)
    # implement logic to decide if the rule should be re-run
    # e.g. set to True if len() > 50
    needs_updating = True if len(df)>50 else False
    return needs_updating

# use python to execute conditionally
if file_is_updated():
    rule some_rule:
        input: csv_file
        ...