8 snakemake
usage
This chapter describes how we can use snakemake
to make scheduling comparison jobs easier to handle than writing/maintaining our own scheduler(s).
8.1 Advantages of snakemake
8.1.1 Snakemake is well-maintained, and widely-used
8.1.2 Snakemake distributes jobs
A major hassle with the older versions of pyani
was keeping on top of cluster scheduler changes. snakemake
has plugins for cluster and cloud operation, so it should be possible for us to ride on top of that and maintain a single interface to snakemake
to handle our job dependencies and distribution.
8.2 Using the snakemake
CLI
With the CLI we call snakemake
and pass it the path to the snakefile (which defines and configures the workflow), the number of cores, and any other arguments. The snakefile An example is given below.
# Example snakemake file performing pairwise comparisons
from itertools import permutations
# Pick up genome filestems
= glob_wildcards("data/{genome}.fna")
(GENOMES,) = list(permutations(GENOMES, 2)) # all pairwise comparisons fwd and reverse
CMPS
# Rule `all` defines all A vs B comparisons, the `nucmer` rule runs a
# single pairwise comparison at a time
# The `zip` argument to `expand()` prevents this function generating the
# product of every member of each list. Instead we have extracted each
# participant in all pairwise comparisons into separate lists
all:
rule input:
expand("results/{genomeA}_vs_{genomeB}.delta",
zip,
=[_[0] for _ in CMPS],
genomeA=[_[1] for _ in CMPS],
genomeB# outdir=OUTDIR,
),
# The nucmer rule runs nucmer in the forward direction only
rule nucmer:
output:"{outdir}/{genomeA}_vs_{genomeB}.delta",
run:
shell("nucmer data/{wildcards.genomeA}.fna data/{wildcards.genomeB}.fna "
"-p {wildcards.outdir}/{wildcards.genomeA}_vs_{wildcards.genomeB} "
"--maxmatch"
)
This example will take all the .fna
files in the data
subdirectory and generate all the pairwise combinations (forward and reverse) for comparison. Running the snakefile with snakemake --snakefile example.sml --cores all
runs these pairwise comparisons and puts the output in the results
subdirectory.
8.3 Programmatic workflows with snakemake
As snakemake
is written in Python and exposes its internals with an API, we can control workflows programmatically. In the example.py
file, we use the API to specify target output files for the example.smk
workflow, giving us control over which comparisons are run.
This ability will be important for us, as we will be using a database backend to avoid the need to rerun comparisons for which we already have data. So our actual workflow will be something like:
- parse input data directory
- filter the input data against the databse to identify only comparisons that have not been run before
- pass the expected output files to
snakemake
so that snakemake handles the job distribution
# Example file calling snakemake scheduler from Python
from pathlib import Path
from snakemake.api import SnakemakeApi, _get_executor_plugin_registry
from snakemake.settings import ConfigSettings, DAGSettings, ResourceSettings
# In a real situation, we can choose a snakefile to suit the analysis
= Path("example.smk")
snakefile
# Define arguments to pass to the snakefile
# config_args = {"outdir": "script_results"}
# Define a subset of target files to generate
= ["script_results/genome_2_vs_genome_3.delta",
target_files "script_results/genome_4_vs_genome_3.delta"]
# Use the defined workflow from the Python API
with SnakemakeApi() as snakemake_api:
= snakemake_api.workflow(snakefile=snakefile,
workflow_api =ResourceSettings(cores=8),
resource_settings=ConfigSettings(
config_settings=config_args,
config
)
)= workflow_api.dag(
dag_api = DAGSettings(
dag_settings =target_files,
targets
)
) dag_api.execute_workflow()
The snakemake
API is not documented in a detailed way. The best resource is to start at the actual CLI code for snakemake
, linked below.