Home > Software design >  How to run an AWK script within the input section of snakemake (in order to retrieve a snakemake inp
How to run an AWK script within the input section of snakemake (in order to retrieve a snakemake inp

Time:05-20

I'm trying to build a snakemake pipeline for analysing RNAseq datasets. Some datasets might be (1) single end or paired end, and (2) the strandedness between datasets may vary as well. I want the snakemake pipeline to automatically detect these 2 variables and fill it in the RSEM input.

I've come up with an approach using infer_experiment.py from RSeQC that outputs file "strandedness.txt" (note the first 2 lines are empty).

This is PairEnd Data Fraction of reads failed to determine: 0.0134 Fraction of reads explained by "1 ,1--,2 -,2- ": 0.0049 Fraction of reads explained by "1 -,1- ,2 ,2--": 0.9817

Now I want to use AWK in order to extract for example the "PairEnd" from file "strandedness.txt" using the following AWK code, which works fine in bash:

awk 'FNR == 3 {print $3}' ./strandedness.txt

Which outputs correctly "PairEnd"

I put the code in this AWK script "pairend.awk":

#!/bin/awk -f
FNR == 3 {print $3}

If I call it separately it works fine:

./pairend.awk ./strandedness.txt

Now I'd like to call the script in my snakemake file, but I can't figure out how this can be done. Most examples on google use AWK scripts in the shell, but I'd actually want to use the script to output a variabled that will be used as input for my snakemake pipeline. I tried this, but can't get it to work:

DATASET,SAMPLE,FRR, =glob_wildcards(config["project_path"] "resources/raw_datasets/{dataset}/{sample}_Homo_sapiens_RNA-Seq_{frr}.fastq.gz")
print(DATASET,SAMPLE,FRR)

rule all:
        input:
                expand(config["project_path"] "results/{dataset}/RSEM/{sample}_strandedness.txt",dataset=DATASET, sample=SAMPLE)

## Transcript quantification
rule RSEM_calculate_expression:
        input:
                fasta=config["project_path"] "resources/Homo_sapiens.GRCh38.dna.primary_assembly.fa",
                gtf=config["project_path"] "resources/Homo_sapiens.GRCh38.106.gtf",
                bam=config["project_path"] "results/{dataset}/star_aligned_2pass/{sample}_Aligned.sortedByCoord.out.bam",
                paired_end=config["project_path"] "workflow/pairend.awk" config["project_path"] "results/{dataset}/RSEM/{sample}_strandedness.txt"
        output:
                gene_results=config["project_path"] "results/{dataset}/RSEM/{sample}.genes.results",
                isoform_results=config["project_path"] "results/{dataset}/RSEM/{sample}.isoforms.results"
        params:
                index=config["project_path"] "resources/starIndex/"
        threads:
                12
        log:
                config["project_path"] "results/{dataset}/RSEM/{sample}.log"
        shell:
                """
                rsem-calculate-expression --star --num-threads {threads} {paired_end} --alignments {input.bam} {input.index} {wildcards.sample} {log}
                """

I've also tried to call the AWK code from an AWK script without success:

Error: SyntaxError in line ... of snakefile: invalid syntax

I dotted out the line, but it points towards the paired_end=config["project_path"] "workflow/pairend.awk" config["project_path"] "results/{dataset}/RSEM/{sample}_strandedness.txt"

How do I correctly run my/an AWK script within the input section of snakemake (I'm not interested in running it in the shell section. Any help would be much appreciated!

CodePudding user response:

Because you are looking for a string instead of a file, this should be placed into params instead of input. You could maybe use the shell function to call awk from an input function, but I've only ever seen it in the run directive.

I think the easiest options are to

  • Convert your awk command to an input function in python.
def get_pairend(wildcards, input):
    # since your file is small, just read the whole thing.  If it's larger don't read everything.
    # get third line, third token
    return open(input.paired_end).readlines()[2].split()[2]

rule RSEM_calculate_expression:
    input:
        ...
        paired_end=config["project_path"] "results/{dataset}/RSEM/{sample}_strandedness.txt"
        ...
    params:
        index=config["project_path"] "resources/starIndex/",
        paired_end=get_pairend,
    shell:
        """
        rsem-calculate-expression --star --num-threads {threads} {params.paired_end} --alignments {input.bam} {input.index} {wildcards.sample} {log}
        """
  • Run awk in your shell directive
rule RSEM_calculate_expression:
    input:
        ...
        paired_end=config["project_path"] "results/{dataset}/RSEM/{sample}_strandedness.txt"

    shell:
        """
        paired=$(awk 'FNR == 3 {{print $3}}' {input.paired_end})
        rsem-calculate-expression --star --num-threads {threads} $paired --alignments {input.bam} {input.index} {wildcards.sample} {log}
        """

I would prefer the first option. It's easier to modify if you decide to do something more complicated and you don't need extra awk knowledge to understand what you are doing.

  • Related