Using nextflow to create Bioinformatics Pipelines (Part 1 — Why use Nextflow?)

nextflow.io

Howdy folks! I am writing this to document and share my journey as I learn and implement nextflow into my bioinformatics workflows.

What is nextflow?

Nextflow is a Domain-Specific Language (DSL) that is developed by the friendly folks at the Seqera Labs. It is created specifically to handle common bioinformatics data-sets and data-structures and has many tricks up its sleeve that allows for fast prototyping and deployments on any platform.

Who is it for?

If you work in a scenario where you need to process hundred’s of DNA/RNA sequenced samples every week then nextflow is for you!

Many bioinformatics tasks or processes that you run, for example running fastqc are sequential by default and will eat away a lot of compute time as you wait to run the next step in your workflow. e.g. to run fastqc on all of your fastq files, you may use a for loop like below to get the results.

for i in *.fastq.gz ; do fastqc $i ; done

Note that this loop is sequential, meaning it will only process one file at a time and if fastqc takes 10 minutes to process one fastq file, it can add up very quickly as you process hundreds of samples altogether.

Now of course there are workarounds e.g. use of gnu parallel that you can implement for certain tasks and decrease your turn around time but it can be extremely daunting for beginners and requires a deeper understanding of the underlying architecture of the machine to take full advantage of it.

find *.fastq.gz | parallel -j 10 "fastqc {} --outdir ."

In nextflow, all of the processes and tasks are inherently parallel and allows the user to scale-up or scale-out without having to configure or adapt platform specific architecture. This along with many other features as detailed here are the reasons for nextflow’s extreme popularity and adaptability in recent years.

On a side note, many bioinformatics jobs these days are now seeking individuals with familiarity with nextflow and also if you like to participate in Hackathons, learning nexflow will definitely help you elevate your resume.

Why use nextflow?

Many bioinformatics pipelines comprise of various tools that one uses in a dataflow programming manner to get the final results. Some of these tools may be accessible with the BASH interpreter, some are written in python and will require the use of a python interpreter and some are written in R and will require R console for data analysis. Often times many bioinformaticians or data analysts have custom scripts that they also run on their samples in order to get desired results.

Doing this requires jumping from one programming interpreter to another which is not possible to do from a single script file, however nextflow makes this easier and allows the user to define separate code blocks in its processes and makes it super-efficient to chain all of your analysis steps into one main file, which again is a huge benefit.

Pre-requisites

Learning nextflow can be extremely difficult or extremely easy depending on your experience with various programming concepts. If you are used to of writing some code and understand the basic data-types and data-structures e.g. lists, tuples, dictionaries, strings etc. it will definitely be a huge plus.

Nextflow is built on top of Groovy programming language and groovy is built on top of JAVA. Any familiarity with either of those languages will also be beneficial but is not required.

Part ii

Stay tuned for part ii in which we will go over key concepts of nextflow.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Faraz Ahmed

Faraz Ahmed

Bioinformatics Programmer III @Cornell