process<name>{[directives]// (1)!input:// (2)!<processinputs>output:// (3)!<processoutputs>when:// (4)!<condition>[script|shell|exec]:// (5)!""" < user script to be executed > """}
Zero, one or more process directives
Zero, one or more process inputs
Zero, one or more process outputs
An optional boolean conditional to trigger the process execution
The script block is a string statement that defines the command to be executed by the process.
A process can execute only one script block. It must be the last statement when the process contains input and output declarations.
The script block can be a single or a multi-line string. The latter simplifies the writing of non-trivial scripts composed of multiple commands spanning over multiple lines. For example:
By default, the process command is interpreted as a Bash script. However, any other scripting language can be used by simply starting the script with the corresponding Shebang declaration. For example:
processPYSTUFF{script:""" #!/usr/bin/env python x = 'Hello' y = 'world!' print ("%s - %s" % (x, y)) """}workflow{PYSTUFF()}
Tip
Multiple programming languages can be used within the same workflow script. However, for large chunks of code it is better to save them into separate files and invoke them from the process script. One can store the specific scripts in the ./bin/ folder.
A process script can contain any string format supported by the Groovy programming language. This allows us to use string interpolation as in the script above or multiline strings. Refer to String interpolation for more information.
Warning
Since Nextflow uses the same Bash syntax for variable substitutions in strings, Bash environment variables need to be escaped using the \ character. The escaped version will be resolved later, returning the task directory (e.g. work/7f/f285b80022d9f61e82cd7f90436aa4/), while $PWD would show the directory where you're running Nextflow.
processFOO{script:""" echo "The current directory is \$PWD" """}workflow{FOO()}
It can be tricky to write a script that uses many Bash variables. One possible alternative is to use a script string delimited by single-quote characters
processBAR{script:''' echo "The current directory is $PWD" '''}workflow{BAR()}
However, this blocks the usage of Nextflow variables in the command script.
Another alternative is to use a shell statement instead of script and use a different syntax for Nextflow variables, e.g., !{..}. This allows the use of both Nextflow and Bash variables in the same script.
The process script can also be defined in a completely dynamic manner using an if statement or any other expression for evaluating a string value. For example:
Nextflow process instances (tasks) are isolated from each other but can communicate between themselves by sending values through channels.
Inputs implicitly determine the dependencies and the parallel execution of the process. The process execution is fired each time new data is ready to be consumed from the input channel:
The input block defines the names and qualifiers of variables that refer to channel elements directed at the process. You can only define one input block at a time, and it must contain one or more input declarations.
The val qualifier allows you to receive data of any type as input. It can be accessed in the process script by using the specified input name, as shown in the following example:
num=Channel.of(1,2,3)processBASICEXAMPLE{debugtrueinput:valxscript:""" echo process job $x """}workflow{myrun=BASICEXAMPLE(num)}
In the above example the process is executed three times, each time a value is received from the channel num and used to process the script. Thus, it results in an output similar to the one shown below:
process job 3process job 1process job 2
Warning
The channel guarantees that items are delivered in the same order as they have been sent - but - since the process is executed in a parallel manner, there is no guarantee that they are processed in the same order as they are received.
The path qualifier allows the handling of file values in the process execution context. This means that Nextflow will stage it in the process execution directory, and it can be accessed in the script by using the name specified in the input declaration.
reads=Channel.fromPath('data/ggal/*.fq')processFOO{debugtrueinput:pathsamplescript:""" ls -lh $sample """}workflow{FOO(reads.collect())}
Warning
In the past, the file qualifier was used for files, but the path qualifier should be preferred over file to handle process input files when using Nextflow 19.10.0 or later. When a process declares an input file, the corresponding channel elements must be file objects created with the file helper function from the file specific channel factories (e.g., Channel.fromPath or Channel.fromFilePairs).
Exercise
Write a script that creates a channel containing all read files matching the pattern data/ggal/*_1.fq followed by a process that concatenates them into a single file and prints the first 10 lines.
A key feature of processes is the ability to handle inputs from multiple channels. However, it’s important to understand how channel contents and their semantics affect the execution of a process.
ch1=Channel.of(1,2,3)ch2=Channel.of('a','b','c')processFOO{debugtrueinput:valxvalyscript:""" echo $x and $y """}workflow{FOO(ch1,ch2)}
Both channels emit three values, therefore the process is executed three times, each time with a different pair:
(1, a)
(2, b)
(3, c)
What is happening is that the process waits until there’s a complete input configuration, i.e., it receives an input value from all the channels declared as input.
When this condition is verified, it consumes the input values coming from the respective channels, spawns a task execution, then repeats the same logic until one or more channels have no more content.
This means channel values are consumed serially one after another and the first empty channel causes the process execution to stop, even if there are other values in other channels.
So what happens when channels do not have the same cardinality (i.e., they emit a different number of elements)?
input1=Channel.value(1)input2=Channel.of('a','b','c')processBAR{debugtrueinput:valxvalyscript:""" echo $x and $y """}workflow{BAR(input1,input2)}
Script output
1 and b1 and a1 and c
This is because value channels can be consumed multiple times and do not affect process termination.
Exercise
Write a process that is executed for each read file matching the pattern data/ggal/*_1.fq and use the same data/ggal/transcriptome.fa in each execution.
In the above example, every time a file of sequences is received as an input by the process, it executes three tasks, each running a different alignment method set as a mode variable. This is useful when you need to repeat the same task for a given set of parameters.
Exercise
Extend the previous example so a task is executed for each read file matching the pattern data/ggal/*_1.fq and repeat the same task with both salmon and kallisto.
The val qualifier specifies a defined value in the script context. Values are frequently defined in the input and/or output declaration blocks, as shown in the following example:
In the above example the process RANDOMNUM creates a file named result.txt containing a random number.
Since a file parameter using the same name is declared in the output block, the file is sent over the receiver_ch channel when the task is complete. A downstream process declaring the same channel as input will be able to receive it.
When an output file name contains a wildcard character (* or ?) it is interpreted as a glob path matcher. This allows us to capture multiple files into a list object and output them as a sole emission. For example:
When an output file name needs to be expressed dynamically, it is possible to define it using a dynamic string that references values defined in the input declaration block or in the script global context. For example:
So far we have seen how to declare multiple input and output channels that can handle one value at a time. However, Nextflow can also handle a tuple of values.
The input and output declarations for tuples must be declared with a tuple qualifier followed by the definition of each element in the tuple.
The when declaration allows you to define a condition that must be verified in order to execute the process. This can be any expression that evaluates a boolean value.
It is useful to enable/disable the process execution depending on the state of various inputs and parameters. For example:
Directive declarations allow the definition of optional settings that affect the execution of the current process without affecting the semantic of the task itself.
They must be entered at the top of the process body, before any other declaration blocks (i.e., input, output, etc.).
Directives are commonly used to define the amount of computing resources to be used or other meta directives that allow the definition of extra configuration of logging information. For example:
Given each task is being executed in separate temporary work/ folder (e.g., work/f1/850698…; work/g3/239712…; etc.), we may want to save important, non-intermediary, and/or final files in a results folder.
Tip
Remember to delete the work folder from time to time to clear your intermediate files and stop them from filling your computer!
To store our workflow result files, we need to explicitly mark them using the directive publishDir in the process that’s creating the files. For example:
The above example will copy all blast script files created by the BLASTSEQ process into the directory path my-results.
Tip
The publish directory can be local or remote. For example, output files could be stored using an AWS S3 bucket by using the s3:// prefix in the target path.
The above example will create an output structure in the directory my-results, that contains a separate sub-directory for each given sample ID, each containing the folders counts and outlooks.