Topics to be covered
- Anaconda
- Snakemake
- Docker
Let’s try to make a small bioinformatics pipeline
Let’s say we have some FASTA files containing protein sequences, and we’d like to make a phylogeny with those sequences.
To do this we would make the following bash script, called butterbean.sh
It would look like the following
#!/bin/bash
mafft --auto --phylipout $1 > $1.aln
fasttree $1.aln > $1.tree
Data for the pipeline!
Create a file called ex1.fasta
from here
The pipeline works the following way:
First, mafft
takes as input a fasta file of protein sequences as from input file the first argument $1
, in this case ex1.fasta
. The program then automatically chooses the best algorithm for generating an alignment as prompted by --auto
, it then produces a multiple sequence alignment named ex1.fasta.aln
from $1.aln
Then, fasttree
takes as input the ouput from mafft
referenced as $1.aln
in our case known as ex1.fasta.aln
. It then prodeces a newick format phylogeny of the sequences in the file referenced as $1.tree
in our case ex1.fasta.tree
Running our pipeline
Issue the folowing command to run the pipeline butterbean.sh
on our data ex1.fasta
bash butterbean.sh ex1.fasta
What did the output look like?
You may have received the following error message
butterbean.sh: line 2: mafft: command not found
butterbean.sh: line 3: fasttree: command not found
What went wrong????? 😠😠ðŸ˜
Well,mafft
and fasttree
are dependencies for butterbean
and must be installed on your system in order to work properly.
If only there were a system in place for managing software packages and dependencies….
Software package dependency and environment management system.
Allows for easy version control!
Make a conda environment for butterbean’s dependencies
Make a file called butterbean.requirements.txt
with the following information in it
mafft=7.310
fasttree=2.1.9
This file can be used to create a conda environment called {your_name_here}_butterbean using the following command
DO NOT USE UPPER CASE IN {your_name_here}
conda create --name {your_name_here}_butterbean -c bioconda --file butterbean.requirements.txt
Issue the following command to activate this environment
source activate {your_name_here}_butterbean
Now we can rerun the pipeline
bash butterbean.sh ex1.fasta
Hopefully it actually worked!
You should have a tree!
What if we wanted to run our pipeline on hundreds of fasta files?
Can we add multithread support?
Can we have the dependencies automatically install themselves?
Snakemake
This is a workflow management system that can increase the usability and reproducibility of our pipeline
Setup conda environments
First, we need to take care of dependencies. We can set up conda environments for snakemake to setup and run on the fly during analysis
make a directory called envs
In that directory, make the file envs/mafft.yaml
with the following contents
channels:
- bioconda
dependencies:
- mafft=7.310
Make another file envs/fasttree.yaml
with the following contents
channels:
- bioconda
dependencies:
- fasttree=2.1.9
Make Snakefile
Make a file called Snakefile
with the following information in it
SAMPLES, = glob_wildcards("{sample}.fasta")
This uses wild cards to find all files that end in .fasta that will be used by the pipeline
Add the final subroutine to Snakefile
rule final:
input:expand("{sample}.tree",sample=SAMPLES)
Then, add the first subroutine
rule mafft:
input:
"{sample}.fasta"
output:
"{sample}.aln"
conda:
"envs/mafft.yaml"
shell:
"mafft --auto --phylipout {input} > {output}"
Then, we add the second subroutine
rule fasttree:
input:
"{sample}.aln"
output:
"{sample}.tree"
conda:
"envs/fasttree.yaml"
shell:
"fasttree {input} > {output}"
Altogether, the file looks like the following
SAMPLES, = glob_wildcards("{sample}.fasta")
rule final:
input:expand("{sample}.tree",sample=SAMPLES)
rule mafft:
input:
"{sample}.fasta"
output:
"{sample}.aln"
conda:
"envs/mafft.yaml"
shell:
"mafft --auto --phylipout {input} > {output}"
rule fasttree:
input:
"{sample}.aln"
output:
"{sample}.tree"
conda:
"envs/fasttree.yaml"
shell:
"fasttree {input} > {output}"
Run the pipeline with the following command
snakemake --use-conda
What if we want others to rerun our analysis on their computer with a non supported operating system?
Can we distribute our pipeline as a self contained virtual environment
Create a file Dockerfile
with the following information
FROM ubuntu
MAINTAINER Jeff Cole <coleti16@students.ecu.edu>
RUN apt-get -qq update
RUN apt-get install -y wget git build-essential cmake unzip curl
RUN apt-get install -qqy python3-setuptools python3-docutils python3-flask
RUN easy_install3 snakemake
RUN echo 'export PATH=/opt/conda/bin:$PATH' > /etc/profile.d/conda.sh && \
wget --quiet https://repo.continuum.io/miniconda/Miniconda3-4.3.14-Linux-x86_64.sh -O ~/miniconda.sh && \
/bin/bash ~/miniconda.sh -b -p /opt/conda && \
rm ~/miniconda.sh
ENV PATH /opt/conda/bin:$PATH
WORKDIR /home/user/
RUN git clone https://github.com/tijeco/ReproducibilityTutorial.git
WORKDIR /home/user/ReproducibilityTutorial
RUN ln -sf /bin/bash /bin/sh
RUN snakemake --use-conda
Build the docker container
docker build -t {your_name_here}_butterbean ./
Run the docker container with the following command
docker run -it {your_name_here}_butterbean /bin/bash