I would like to improve the reproducibility of some python codes I made by transforming some codes into a data pipeline. I am used to targets
in R
and would like to find an equivalent in Python
. I have the impression that snakemake
is quite close to that.
I don't understand how we can use pandas
to import an input in a snakemake
task, modify it and then write output
.
Let's take the easiest pipeline I can think of: we take a csv and write a copy somewhere else.
The pipeline works fine when using bash script:
rule trying_snakemake:
input:
path="untitled.txt"
output:
"test-snakemake.csv"
run:
shell("cp {input.path} {output}")
I wanted to use the equivalent approach with pandas
(of course here using pandas
does not seem necessary but this is to understand the logic):
rule trying_snakemake:
input:
path="untitled.txt"
output:
"test-snakemake.csv"
run:
import pandas as pd
df = pd.read_csv({input.path})
df.to_csv({output}, header=False)
snakemake -c1
Invalid file path or buffer object type: <class 'set'>
File "/home/jovyan/work/label-openfood/Snakefile", line 19, in __rule_trying_snakemake
File "/opt/conda/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 482, in _read
File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 811, in __init__
File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 51, in __init__
File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py", line 222, in _open_handles
File "/opt/conda/lib/python3.9/site-packages/pandas/io/common.py", line 609, in get_handle
File "/opt/conda/lib/python3.9/site-packages/pandas/io/common.py", line 396, in _get_filepath_or_buffer
File "/opt/conda/lib/python3.9/concurrent/futures/thread.py", line 52, in run
Exiting because a job execution failed. Look above for error message
I think the error comes at the read_csv
step but I don't understand what it means (I am used to situations where pandas
works like a charm)
CodePudding user response:
You are very close, the curly braces are not needed within run
directive:
rule trying_snakemake:
input:
path="untitled.txt"
output:
csv="test-snakemake.csv"
run:
import pandas as pd
df = pd.read_csv(input.path)
df.to_csv(output.csv, header=False)