5. Channels¶
Channels are a key data structure of Nextflow that allows the implementation of reactive-functional oriented computational workflows based on the Dataflow programming paradigm.
They are used to logically connect tasks to each other or to implement functional style data transformations.
5.1 Channel types¶
Nextflow distinguishes two different kinds of channels: queue channels and value channels.
5.1.1 Queue channel¶
A queue channel is an asynchronous unidirectional FIFO queue that connects two processes or operators.
- asynchronous means that operations are non-blocking.
- unidirectional means that data flows from a producer to a consumer.
- FIFO means that the data is guaranteed to be delivered in the same order as it is produced. First In, First Out.
A queue channel is implicitly created by process output definitions or using channel factories such as Channel.of or Channel.fromPath.
Try the following snippets:
Click the icons in the code for explanations.
- Use the built-in print line function
println
to print thech
channel - Apply the
view
channel operator to thech
channel prints each item emitted by the channels
Exercise
Try to execute this snippet. You can do that by creating a new .nf
file or by editing an already existing .nf
file.
5.1.2 Value channels¶
A value channel (a.k.a. singleton channel) by definition is bound to a single value and it can be read unlimited times without consuming its contents. A value
channel is created using the value channel factory or by operators returning a single value, such as first, last, collect, count, min, max, reduce, and sum.
To better understand the difference between value and queue channels, save the snippet below as example.nf
.
example.nf | |
---|---|
When you run the script, it prints only 2, as you can see below:
A process will only instantiate a task when there are elements to be consumed from all the channels provided as input to it. Because ch1
and ch2
are queue channels, and the single element of ch2
has been consumed, no new process instances will be launched, even if there are other elements to be consumed in ch1
.
To use the single element in ch2
multiple times, we can either use Channel.value
as mentioned above, or use a channel operator that returns a single element such as first()
below:
Besides, in many situations, Nextflow will implicitly convert variables to value channels when they are used in a process invocation. For example, when you invoke a process with a workflow parameter (params.example
) which has a string value, it is automatically cast into a value channel.
5.2 Channel factories¶
These are Nextflow commands for creating channels that have implicit expected inputs and functions.
5.2.1 value()
¶
The value
channel factory is used to create a value channel. An optional not null
argument can be specified to bind the channel to a specific value. For example:
- Creates an empty value channel
- Creates a value channel and binds a string to it
- Creates a value channel and binds a list object to it that will be emitted as a sole emission
5.2.2 of()
¶
The factory Channel.of
allows the creation of a queue channel with the values specified as arguments.
The first line in this example creates a variable ch
which holds a channel object. This channel emits the values specified as a parameter in the of
channel factory. Thus the second line will print the following:
The Channel.of
channel factory works in a similar manner to Channel.from
(which is now deprecated), fixing some inconsistent behaviors of the latter and providing better handling when specifying a range of values. For example, the following works with a range from 1 to 23:
5.2.3 fromList()
¶
The Channel.fromList
channel factory creates a channel emitting the elements provided by a list object specified as an argument:
5.2.4 fromPath()
¶
The fromPath
channel factory creates a queue channel emitting one or more files matching the specified glob pattern.
This example creates a channel and emits as many items as there are files with a csv
extension in the ./data/meta
folder. Each element is a file object implementing the Path interface.
Tip
Two asterisks, i.e. **
, works like *
but cross directory boundaries. This syntax is generally used for matching complete paths. Curly brackets specify a collection of sub-patterns.
Name | Description |
---|---|
glob | When true interprets characters * , ? , [] and {} as glob wildcards, otherwise handles them as normal characters (default: true ) |
type | Type of path returned, either file , dir or any (default: file ) |
hidden | When true includes hidden files in the resulting paths (default: false ) |
maxDepth | Maximum number of directory levels to visit (default: no limit ) |
followLinks | When true symbolic links are followed during directory tree traversal, otherwise they are managed as files (default: true ) |
relative | When true return paths are relative to the top-most common directory (default: false ) |
checkIfExists | When true throws an exception when the specified path does not exist in the file system (default: false ) |
Learn more about the glob patterns syntax at this link.
Exercise
Use the Channel.fromPath
channel factory to create a channel emitting all files with the suffix .fq
in the data/ggal/
directory and any subdirectory, in addition to hidden files. Then print the file names.
5.2.5 fromFilePairs()
¶
The fromFilePairs
channel factory creates a channel emitting the file pairs matching a glob pattern provided by the user. The matching files are emitted as tuples, in which the first element is the grouping key of the matching pair and the second element is the list of files (sorted in lexicographical order).
It will produce an output similar to the following:
[liver, [/workspace/gitpod/nf-training/data/ggal/liver_1.fq, /workspace/gitpod/nf-training/data/ggal/liver_2.fq]]
[gut, [/workspace/gitpod/nf-training/data/ggal/gut_1.fq, /workspace/gitpod/nf-training/data/ggal/gut_2.fq]]
[lung, [/workspace/gitpod/nf-training/data/ggal/lung_1.fq, /workspace/gitpod/nf-training/data/ggal/lung_2.fq]]
Warning
The glob pattern must contain at least a star wildcard character (*
).
Name | Description |
---|---|
type | Type of paths returned, either file , dir or any (default: file ) |
hidden | When true includes hidden files in the resulting paths (default: false ) |
maxDepth | Maximum number of directory levels to visit (default: no limit ) |
followLinks | When true symbolic links are followed during directory tree traversal, otherwise they are managed as files (default: true ) |
size | Defines the number of files each emitted item is expected to hold (default: 2 ). Set to -1 for any |
flat | When true the matching files are produced as sole elements in the emitted tuples (default: false ) |
checkIfExists | When true , it throws an exception of the specified path that does not exist in the file system (default: false ) |
Exercise
Use the fromFilePairs
channel factory to create a channel emitting all pairs of fastq read in the data/ggal/
directory and print them. Then use the flat: true
option and compare the output with the previous execution.
5.2.6 fromSRA()
¶
The Channel.fromSRA
channel factory makes it possible to query the NCBI SRA archive and returns a channel emitting the FASTQ files matching the specified selection criteria.
The query can be project ID(s) or accession number(s) supported by the NCBI ESearch API.
Info
This function now requires an API key you can only get by logging into your NCBI account.
Instructions for NCBI login and key acquisition
- Go to: https://www.ncbi.nlm.nih.gov/
- Click the top right "Log in" button to sign into NCBI. Follow their instructions.
- Once into your account, click the button at the top right, usually your ID.
- Go to Account settings
- Scroll down to the API Key Management section.
- Click on "Create an API Key".
- The page will refresh and the key will be displayed where the button was. Copy your key.
For example, the following snippet will print the contents of an NCBI project ID:
Replace <Your API key here>
with your API key.
This should print:
[SRR3383346, [/vol1/fastq/SRR338/006/SRR3383346/SRR3383346_1.fastq.gz, /vol1/fastq/SRR338/006/SRR3383346/SRR3383346_2.fastq.gz]]
[SRR3383347, [/vol1/fastq/SRR338/007/SRR3383347/SRR3383347_1.fastq.gz, /vol1/fastq/SRR338/007/SRR3383347/SRR3383347_2.fastq.gz]]
[SRR3383344, [/vol1/fastq/SRR338/004/SRR3383344/SRR3383344_1.fastq.gz, /vol1/fastq/SRR338/004/SRR3383344/SRR3383344_2.fastq.gz]]
[SRR3383345, [/vol1/fastq/SRR338/005/SRR3383345/SRR3383345_1.fastq.gz, /vol1/fastq/SRR338/005/SRR3383345/SRR3383345_2.fastq.gz]]
// (remaining omitted)
Multiple accession IDs can be specified using a list object:
[ERR908507, [/vol1/fastq/ERR908/ERR908507/ERR908507_1.fastq.gz, /vol1/fastq/ERR908/ERR908507/ERR908507_2.fastq.gz]]
[ERR908506, [/vol1/fastq/ERR908/ERR908506/ERR908506_1.fastq.gz, /vol1/fastq/ERR908/ERR908506/ERR908506_2.fastq.gz]]
[ERR908505, [/vol1/fastq/ERR908/ERR908505/ERR908505_1.fastq.gz, /vol1/fastq/ERR908/ERR908505/ERR908505_2.fastq.gz]]
Info
Read pairs are implicitly managed and are returned as a list of files.
It’s straightforward to use this channel as an input using the usual Nextflow syntax. The code below creates a channel containing two samples from a public SRA study and runs FASTQC on the resulting files. See:
If you want to run the workflow above and do not have fastqc installed in your machine, don’t forget what you learned in the previous section. Run this workflow with -with-docker biocontainers/fastqc:v0.11.5
, for example.
5.2.7 Text files¶
The splitText
operator allows you to split multi-line strings or text file items, emitted by a source channel into chunks containing n lines, which will be emitted by the resulting channel. See:
- Instructs Nextflow to make a channel from the path
data/meta/random.txt
- The
splitText
operator splits each item into chunks of one line by default. - View contents of the channel.
You can define the number of lines in each chunk by using the parameter by
, as shown in the following example:
Info
The subscribe
operator permits execution of user defined functions each time a new value is emitted by the source channel.
An optional closure can be specified in order to transform the text chunks produced by the operator. The following example shows how to split text files into chunks of 10 lines and transform them into capital letters:
You can also make counts for each line:
Finally, you can also use the operator on plain files (outside of the channel context):
5.2.8 Comma separate values (.csv)¶
The splitCsv
operator allows you to parse text items emitted by a channel, that are CSV formatted.
It then splits them into records or groups them as a list of records with a specified length.
In the simplest case, just apply the splitCsv
operator to a channel emitting a CSV formatted text files or text entries. For example, to view only the first and fourth columns:
When the CSV begins with a header line defining the column names, you can specify the parameter header: true
which allows you to reference each value by its column name, as shown in the following example:
Alternatively, you can provide custom header names by specifying a list of strings in the header parameter as shown below:
You can also process multiple CSV files at the same time:
Tip
Notice that you can change the output format simply by adding a different delimiter.
Finally, you can also operate on CSV files outside the channel context:
Exercise
Try inputting fastq reads into the RNA-Seq workflow from earlier using .splitCsv
.
Solution
Add a CSV text file containing the following, as an example input with the name "fastq.csv":
gut,/workspace/gitpod/nf-training/data/ggal/gut_1.fq,/workspace/gitpod/nf-training/data/ggal/gut_2.fq
Then replace the input channel for the reads in script7.nf
. Changing the following lines:
To a splitCsv channel factory input:
Finally, change the cardinality of the processes that use the input data. For example, for the quantification process, change it from:
To:
Repeat the above for the fastqc step.
Now the workflow should run from a CSV file.
5.2.9 Tab separated values (.tsv)¶
Parsing TSV files works in a similar way, simply add the sep: '\t'
option in the splitCsv
context:
Exercise
Try using the tab separation technique on the file data/meta/regions.tsv
, but print just the first column, and remove the header.
5.3 More complex file formats¶
5.3.1 JSON¶
We can also easily parse the JSON file format using the splitJson
channel operator.
The splitJson
operator supports JSON arrays:
JSON objects:
And even a JSON array of JSON objects!
Files containing JSON content can also be parsed:
5.3.2 YAML¶
This can also be used as a way to parse YAML files:
- patient_id: ATX-TBL-001-GB-01-105
region_id: R1
feature: pass_vafqc_flag
pass_flag: "TRUE"
- patient_id: ATX-TBL-001-GB-01-105
region_id: R1
feature: pass_stripy_flag
pass_flag: "TRUE"
- patient_id: ATX-TBL-001-GB-01-105
region_id: R1
feature: pass_manual_flag
pass_flag: "TRUE"
- patient_id: ATX-TBL-001-GB-01-105
region_id: R1
feature: other_region_selection_flag
pass_flag: "TRUE"
- patient_id: ATX-TBL-001-GB-01-105
region_id: R1
feature: ace_information_gained
pass_flag: "TRUE"
- patient_id: ATX-TBL-001-GB-01-105
region_id: R1
feature: concordance_flag
pass_flag: "TRUE"
- patient_id: ATX-TBL-001-GB-01-105
region_id: R2
feature: pass_vafqc_flag
pass_flag: "TRUE"
- patient_id: ATX-TBL-001-GB-01-105
region_id: R2
feature: pass_stripy_flag
pass_flag: "TRUE"
- patient_id: ATX-TBL-001-GB-01-105
region_id: R2
feature: pass_manual_flag
pass_flag: "TRUE"
- patient_id: ATX-TBL-001-GB-01-105
region_id: R2
feature: other_region_selection_flag
pass_flag: "TRUE"
- patient_id: ATX-TBL-001-GB-01-105
region_id: R2
feature: ace_information_gained
pass_flag: "TRUE"
- patient_id: ATX-TBL-001-GB-01-105
region_id: R2
feature: concordance_flag
pass_flag: "TRUE"
- patient_id: ATX-TBL-001-GB-01-105
region_id: R3
feature: pass_vafqc_flag
pass_flag: "TRUE"
- patient_id: ATX-TBL-001-GB-01-105
region_id: R3
feature: pass_stripy_flag
pass_flag: "FALSE"
ATX-TBL-001-GB-01-105 -- pass_vafqc_flag
ATX-TBL-001-GB-01-105 -- pass_stripy_flag
ATX-TBL-001-GB-01-105 -- pass_manual_flag
ATX-TBL-001-GB-01-105 -- other_region_selection_flag
ATX-TBL-001-GB-01-105 -- ace_information_gained
ATX-TBL-001-GB-01-105 -- concordance_flag
ATX-TBL-001-GB-01-105 -- pass_vafqc_flag
ATX-TBL-001-GB-01-105 -- pass_stripy_flag
ATX-TBL-001-GB-01-105 -- pass_manual_flag
ATX-TBL-001-GB-01-105 -- other_region_selection_flag
ATX-TBL-001-GB-01-105 -- ace_information_gained
ATX-TBL-001-GB-01-105 -- concordance_flag
ATX-TBL-001-GB-01-105 -- pass_vafqc_flag
ATX-TBL-001-GB-01-105 -- pass_stripy_flag
5.3.3 Storage of parsers into modules¶
The best way to store parser scripts is to keep them in a Nextflow module file.
Let's say we don't have a JSON channel operator, but we create a function instead. The parsers.nf
file should contain the parseJsonFile
function. See the contente below:
ATX-TBL-001-GB-01-105 has pass_stripy_flag as feature
ATX-TBL-001-GB-01-105 has ace_information_gained as feature
ATX-TBL-001-GB-01-105 has concordance_flag as feature
ATX-TBL-001-GB-01-105 has pass_vafqc_flag as feature
ATX-TBL-001-GB-01-105 has pass_manual_flag as feature
ATX-TBL-001-GB-01-105 has other_region_selection_flag as feature
Nextflow will use this as a custom function within the workflow scope.
Tip
You will learn more about module files later in the Modularization section of this tutorial.