Home > database >  How to use pandas within snakemake pipelines
How to use pandas within snakemake pipelines

Time:02-18

I would like to improve the reproducibility of some python codes I made by transforming some codes into a data pipeline. I am used to targets in R and would like to find an equivalent in Python. I have the impression that snakemake is quite close to that.

I don't understand how we can use pandas to import an input in a snakemake task, modify it and then write output.

Let's take the easiest pipeline I can think of: we take a csv and write a copy somewhere else.

The pipeline works fine when using bash script:

rule trying_snakemake:
    input:
        path="untitled.txt"
    output:
        "test-snakemake.csv"
    run:
        shell("cp {input.path} {output}")

I wanted to use the equivalent approach with pandas (of course here using pandas does not seem necessary but this is to understand the logic):

rule trying_snakemake:
    input:
        path="untitled.txt"
    output:
        "test-snakemake.csv"
    run:
        import pandas as pd
        df = pd.read_csv({input.path})
        df.to_csv({output}, header=False)
snakemake -c1
Invalid file path or buffer object type: <class 'set'>
  File "/home/jovyan/work/label-openfood/Snakefile", line 19, in __rule_trying_snakemake
  File "/opt/conda/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 482, in _read
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 811, in __init__
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 51, in __init__
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py", line 222, in _open_handles
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/common.py", line 609, in get_handle
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/common.py", line 396, in _get_filepath_or_buffer
  File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 52, in run
Exiting because a job execution failed. Look above for error message

I think the error comes at the read_csv step but I don't understand what it means (I am used to situations where pandas works like a charm)

CodePudding user response:

You are very close, the curly braces are not needed within run directive:

rule trying_snakemake:
    input:
        path="untitled.txt"
    output:
        csv="test-snakemake.csv"
    run:
        import pandas as pd
        df = pd.read_csv(input.path)
        df.to_csv(output.csv, header=False)
  • Related